CN109947896A - A kind of unstructured flow data real-time storage method of rail traffic - Google Patents

A kind of unstructured flow data real-time storage method of rail traffic Download PDF

Info

Publication number
CN109947896A
CN109947896A CN201910181493.6A CN201910181493A CN109947896A CN 109947896 A CN109947896 A CN 109947896A CN 201910181493 A CN201910181493 A CN 201910181493A CN 109947896 A CN109947896 A CN 109947896A
Authority
CN
China
Prior art keywords
data
flow data
rail traffic
unstructured
rowkey
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910181493.6A
Other languages
Chinese (zh)
Inventor
黄滔
王刚
高杨
刘国庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Bang Sheng Technology Co Ltd
CRRC Tangshan Co Ltd
Original Assignee
Zhejiang Bang Sheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Bang Sheng Technology Co Ltd filed Critical Zhejiang Bang Sheng Technology Co Ltd
Priority to CN201910181493.6A priority Critical patent/CN109947896A/en
Publication of CN109947896A publication Critical patent/CN109947896A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of unstructured flow data real-time storage methods of rail traffic, the unstructured flow data of rail traffic are collected rail traffic big data processing platform, platform is based on Hadoop and its component Hbase distributed data base;Efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition;The unstructured flow data of the multi-source of acquisition is linked into data buffer zone;The amount of buffered data of queue calls HBase multithreading wiring method or flushCommits () method to write data into HBase after reaching threshold value.The present invention proposes the unstructured flow data storage scheme of rail traffic based on HBase, is able to satisfy storage speed demand, the storage capacity requirement of unstructured data storage demand, rail traffic flow data.The present invention proposes the efficient retrieval scheme of the unstructured flow data of rail traffic, and data search efficiency can be accelerated by designing HBase database RowKey with this.The present invention proposes multi-source non-structural flow data multi-source buffer area, and proposes the scheme of corresponding optimization HBase index, improves flow data write efficiency.

Description

A kind of unstructured flow data real-time storage method of rail traffic
Technical field
The invention belongs to field of track traffic more particularly to a kind of unstructured flow data real-time storage sides of rail traffic Method.
Background technique
At present the country urban rail transit construction carry out on a large scale, subway, light rail, tramcar etc. have much runed it is open-minded Route and route is being built, the construction period is short, task is tight, and new line website quantity is increasing, and data scale is increasingly huge.Rail Road field of traffic is related to multiple professions such as transport, building, vehicle, electromechanics, power supply, communication, signal, ring control, and each profession is logical daily It crosses the modes such as artificial, equipment and acquires the data volume of generation in terms of million.These data have magnanimity, multi-source heterogeneous, generation speed Degree and spread speed such as are exceedingly fast at the features, and have contained many useful information.By being carried out to these large scale scale heterogeneous data The data mining of depth and data analysis, excavate wherein valuable information, and the operation level of rail traffic can be improved, promoted Science decision ability, Improve Efficiency reduce cost, promote information service and safety assurance ability.In above-mentioned data, contain big Structuring and non-structured flow data are measured, such as by the image data of various sensors or computer equipment uninterrupted sampling, letter Number.How the research heat that storage inquiry is industry quickly and efficiently to be carried out to large-scale isomery rail traffic flow data Point.
The storage scheme majority that existing rail traffic data-storage system uses is based on traditional Relational DataBase come structure It builds, read-write delay is high, is unable to satisfy the storage speed demand of rail traffic flow data, while horizontal extension ability is poor, Wu Fashi Answer the capacity requirement of the very fast growth of rail traffic flow data.In addition, rare Construction of Data Warehouse case, is all using MPP DB Framework carries out data storage.MPP DB has certain performance advantage on various dimensions complex query, but scalability is poor, concurrency Deficiency cannot store unstructured data, not can be carried out stream process, and disadvantages mentioned above makes MPP DB can't be rail traffic stream The outstanding solution of data-storage system.
Hadoop is the distributed system infrastructure of a mainstream, and the HDFS provided is compared with MPP DB, structure spirit It is living, it is easy to extend, can store unstructured data, the data volume of storage is bigger, supports high concurrent and in real time processing.But it is straight The storage based on HDFS is connect not index, data block is larger, the inefficient operations such as accurate inquiry, query composition, it is thus impossible to HDFS is directly used, suitable Hadoop component is selected, the characteristics of for rail traffic isomery flow data, designs new track The storage scheme of traffic isomery flow data.
Summary of the invention
In order to overcome the shortcomings of that the unstructured flow data memory technology scheme of existing rail traffic, the present invention provide a kind of rail The unstructured flow data real-time storage method of road traffic, solves following problems:
1. the conventional rails traffic storage system read-write delay based on relevant database and MDD DB is high, Wu Faman Sufficient flow data storage speed demand.The present invention solves the problems, such as the quick real-time storage for a large amount of rail traffic flow datas.
2. the conventional rails traffic storage system horizontal extension ability based on relevant database and MDD DB is poor, and Rail Transit System flow data, data volume increase fastly, and capacity requirement is big, and conventional rails traffic storage system is unable to satisfy storage and holds Amount demand.The problem of present invention is solved for data scale rapid growth, and conventional store scheme faces capacity bottleneck.
3. the conventional rails traffic storage system based on relevant database and MDD DB can not store unstructured Flow data.The present invention solves the problems, such as that conventional store scheme can not store unstructured rail traffic flow data real-time, quickly.
4. the efficient fast quick checking that the rail traffic data storage scheme based on HDFS is unable to satisfy application in track transportation Inquiry demand.The present invention solves the problems, such as that the rail traffic data storage scheme search efficiency based on HDFS is low.
The purpose of the present invention is achieved through the following technical solutions: a kind of unstructured flow data of rail traffic is real-time Storage method, method includes the following steps:
Step 1: the unstructured flow data of rail traffic being collected into rail traffic big data processing platform, platform is based on Hadoop and its component Hbase distributed data base.
Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition
Step 2.1 website flow data retrieval scheme, specifically includes following sub-step:
Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.
Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created region。
Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index Operation, obtains a prefix Prefix;
Prefix=HashCode (RowKey) %N
RowKey is updated according to prefix Prefix.
Step 2.2 equipment flow data retrieval scheme, specifically includes following sub-step:
Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.
Step 2.2.2 carries out pre- subregion to equipment flow data, and before data are written to HBase, M sky is pre-created region。
Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Obtain a prefix Prefix;
Prefix=HashCode (RowKey) %M
RowKey is updated according to prefix Prefix.
Step 3: the unstructured flow data of the multi-source of acquisition is linked into data buffer zone.
Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or FlushCommits () method writes data into HBase.
Further, the unstructured flow data of rail traffic includes image file, audio file, video file.
Further, in the step 2.1.1, the design of RowKey structure is as follows:
Website Route Device type Device id Affiliated subsystem Data type
Further, in the step 2.1.3, RowKey is updated to following form:
Prefix Website Route Device type Device id Affiliated subsystem Data type
Further, in the step 2.2.1, the design of RowKey structure is as follows:
Device type Device id Route Affiliated subsystem Data type
Further, in the step 2.1.1 and 2.2.1, by RowKey and timestamp combined index, user can be accurate Specific data is positioned, efficiency data query can also be improved to data temporally range query.
Further, empty region be pre-created in step 2.2.2 several M, which are greater than in step 2.1.2, to be pre-created Empty region several N.
Further, in the step 2.2.3, RowKey is updated to following form:
Prefix Device type Device id Route Affiliated subsystem Data type
Further, the step 3 specifically: multiple buffering queues are constructed respectively to different types of data, by data Timestamp timestamp obtains buffering queue number QueueID to M remainder after converting by Hash, and M is the data class of the data The corresponding buffering queue number of type, places data into corresponding buffering queue according to QueueID.
QueueID=HashCode (timestamp) %M.
Further, in the step 3,3 buffering queues are constructed to every kind of data type.
The beneficial effects of the present invention are:
1. the present invention proposes a kind of unstructured flow data storage scheme of the rail traffic based on HBase, it is able to satisfy non-knot Structure data storage requirement, storage speed demand, the storage capacity requirement of rail traffic flow data.
2. efficient retrieval scheme (step 2.1.1, step that the present invention proposes a kind of unstructured flow data of rail traffic 2.2.1), data search efficiency can be accelerated with this conceptual design HBase database RowKey.
3. the present invention proposes multi-source non-structural flow data multi-source buffer area, and proposes the side of corresponding optimization HBase index Case (step 2.1.3, step 2.2.3) improves flow data write efficiency.
Detailed description of the invention
Fig. 1 is overall construction drawing;
Fig. 2 is multi-source buffer area structure chart.
Specific embodiment
In the following detailed description, with reference to the attached drawing for forming a part of the invention, wherein passing through graphic side Formula shows implementable a specific embodiment of the invention.It should be understood that without departing from the scope of the invention, it can be utilized Its embodiment and the change that structure or logic can be carried out.For example, the feature that an embodiment is shown or described can For or in conjunction with other embodiment to generate another embodiment.It include these modifications and variations its object is to the present invention. Use specific language (it is not necessarily to be construed as limitation the scope of the appended claims) description embodiment.Attached drawing do not press than Example draws and only for purposes of discussion.
A kind of unstructured flow data real-time storage method of rail traffic proposed by the present invention, this method is based on HBase points Cloth database carries out flow data storage, the specific steps are as follows:
Step 1: flow data unstructured for rail traffic, including the image file (picture of Image such as passenger's publication Deng), audio file (Audio for example sound pick-up outfit generate audio files), video file (Video such as monitor video, three-dimensional animation Deng), rail traffic big data processing platform is collected, platform is based on Hadoop and its component Hbase distributed data base.
Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition.Step is described in detail as follows:
Step 2.1 website flow data retrieval scheme: website flow data includes flow data relevant to website, such as site monitoring Video.Detailed step is as follows:
Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.The design of RowKey structure is as follows:
Website Route Device type Device id Affiliated subsystem Data type
RowKey is combined with timestamp may make up following form index:
By RowKey and timestamp combined index, user can be accurately positioned specific data, can also be to data temporally Range query, improve efficiency data query.
Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created Region (preferably 16).(in HBase storing data under default situations, automatically creates one when creating HBase table Region subregion, when data are written, all HBase clients all can write data to this subregion, after this region is sufficiently large Cutting is just carried out, this can cause single node load rise, and write efficiency reduces;In order to solve this problem, the side of pre- subregion is proposed Case.)
Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index Operation, obtains a prefix Prefix.
Prefix=HashCode (RowKey) %N.
Wherein for RowKey as described in step 2.1.1, N is the number for dividing region in advance in step 2.1.2.Then RowKey is updated to following form:
Above-mentioned steps are to eliminate the hot spot of HBase (Hot Spot) problem.Hot issue refers to that flow data can be on time Between be sequentially sequentially inserted into a region, when this region reach certain threshold value after can be just inserted into other region, in this way Mode greatly reduce data write efficiency.It is operated by Hash remainder and introduces prefix, it can be by continuous flow data more It is uniformly inserted into the region of pre- subregion, to improve write efficiency.
Step 2.2 equipment flow data retrieval scheme, including regarded with various kinds of equipment and train related data, such as train supervision Frequently, equipment is recorded;Detailed step it is following (be similar to step 1):
Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp compositional modeling can position specific data.The design of RowKey structure is as follows:
Device type Device id Route Affiliated subsystem Data type
RowKey is combined with timestamp may make up following form index:
By RowKey and timestamp combined index, user can be accurately positioned specific data, can also be to data temporally Range query, improve efficiency data query.
Step 2.2.2 carries out pre- subregion to equipment flow data, and M sky before data are written to HBase, is being pre-created region.(preferably 32, the pre- number of partitions of equipment flow data is more than website flow data, because device category is more, measures bigger)
Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Obtain a prefix Prefix.
Prefix=HashCode (RowKey) %M.
Wherein RowKey is RowKey described in step 2.2.1, and M is to divide taking for region in advance in step 2.2.2 Value.Then RowKey is updated to following form:
Prefix Device type Device id Route Affiliated subsystem Data type
Step 3: the unstructured flow data of the multi-source of acquisition being linked into data buffer zone, is realized to different type flow data Real-time processing, improve batch write-in speed.Detailed step is as follows:
It constructs multiple buffering queues respectively to different types of data and (preferably constructs 3 bufferings to every kind of data type Queue), obtaining buffering queue number QueueID to M remainder after the timestamp timestamp of data is converted by Hash, (M is The corresponding buffering queue number of the data type of the data), it is placed data into corresponding buffering queue according to QueueID.
QueueID=HashCode (timestamp) %M
Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or FlushCommits () method writes data into HBase.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of unstructured flow data real-time storage method of rail traffic, which is characterized in that method includes the following steps:
Step 1: the unstructured flow data of rail traffic being collected into rail traffic big data processing platform, platform is based on Hadoop And its component Hbase distributed data base.
Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition
Step 2.1 website flow data retrieval scheme, specifically includes following sub-step:
Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and timestamp Combination can position specific data.
Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky region is pre-created.
Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Prefix Prefix, Prefix=HashCode (RowKey) %N is obtained, RowKey is updated according to prefix Prefix.
Step 2.2 equipment flow data retrieval scheme, specifically includes following sub-step:
Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and timestamp Combination can position specific data.
Step 2.2.2 carries out pre- subregion to equipment flow data, and before data are written to HBase, M sky region is pre-created.
Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, obtains One prefix Prefix, Prefix=HashCode (RowKey) %M updates RowKey according to prefix Prefix.
Step 3: the unstructured flow data of the multi-source of acquisition is linked into data buffer zone.
Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or FlushCommits () method writes data into HBase.
2. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 1, the unstructured flow data of rail traffic includes image file, audio file, video file.
3. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.1, the design of RowKey structure is as follows:
Website Route Device type Device id Affiliated subsystem Data type
4. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.3, RowKey is updated to following form:
5. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.2.1, the design of RowKey structure is as follows:
Device type Device id Route Affiliated subsystem Data type
6. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.1 and 2.2.1, by RowKey and timestamp combined index, user can be accurately positioned specific data, can also be with To data temporally range query, efficiency data query is improved.
7. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that step Empty region several M being pre-created in rapid 2.2.2 are greater than empty region several N being pre-created in step 2.1.2.
8. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.2.3, RowKey is updated to following form:
Prefix Device type Device id Route Affiliated subsystem Data type
9. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute State step 3 specifically: construct multiple buffering queues respectively to different types of data, the timestamp timestamp of data is passed through Buffering queue number QueueID is obtained to M remainder after Hash conversion, M is the corresponding buffering queue of data type of the data Number, places data into corresponding buffering queue according to QueueID.
QueueID=HashCode (timestamp) %M.
10. a kind of unstructured flow data real-time storage method of rail traffic according to claim 9, which is characterized in that In the step 3,3 buffering queues are constructed to every kind of data type.
CN201910181493.6A 2019-03-11 2019-03-11 A kind of unstructured flow data real-time storage method of rail traffic Pending CN109947896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910181493.6A CN109947896A (en) 2019-03-11 2019-03-11 A kind of unstructured flow data real-time storage method of rail traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910181493.6A CN109947896A (en) 2019-03-11 2019-03-11 A kind of unstructured flow data real-time storage method of rail traffic

Publications (1)

Publication Number Publication Date
CN109947896A true CN109947896A (en) 2019-06-28

Family

ID=67009532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910181493.6A Pending CN109947896A (en) 2019-03-11 2019-03-11 A kind of unstructured flow data real-time storage method of rail traffic

Country Status (1)

Country Link
CN (1) CN109947896A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850572A (en) * 2014-11-18 2015-08-19 中兴通讯股份有限公司 HBase non-primary key index building and inquiring method and system
CN107943890A (en) * 2017-11-16 2018-04-20 武汉虹旭信息技术有限责任公司 Mobile Internet mass data processing system and method based on HBase
CN108009290A (en) * 2017-12-25 2018-05-08 国电南瑞科技股份有限公司 A kind of data modeling and storage method of track traffic command centre gauze big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850572A (en) * 2014-11-18 2015-08-19 中兴通讯股份有限公司 HBase non-primary key index building and inquiring method and system
CN107943890A (en) * 2017-11-16 2018-04-20 武汉虹旭信息技术有限责任公司 Mobile Internet mass data processing system and method based on HBase
CN108009290A (en) * 2017-12-25 2018-05-08 国电南瑞科技股份有限公司 A kind of data modeling and storage method of track traffic command centre gauze big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹鹏飞: "基于Hbase的健康监测大数据平台性能优化研究与应用", 《中国优秀硕士学位论文全文数据库》 *
瞿龙俊: "基于Hbase的交通流数据实时存储与查询优化方案的设计与优化", 《中国优秀硕士学位论文全文数据库》 *

Similar Documents

Publication Publication Date Title
CN106934014B (en) Hadoop-based network data mining and analyzing platform and method thereof
CN106649656B (en) Database-oriented space-time trajectory big data storage method
CN102521405B (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN104317966B (en) A kind of dynamic index method inquired about for electric power big data Rapid Combination
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
CN103390038B (en) A kind of method of structure based on HBase and retrieval increment index
CN103491187B (en) A kind of big data united analysis processing method based on cloud computing
CN103631909B (en) System and method for combined processing of large-scale structured and unstructured data
CN103020204B (en) A kind of method and its system carrying out multi-dimensional interval query to distributed sequence list
CN107590250A (en) A kind of space-time orbit generation method and device
WO2013182054A1 (en) Memory retrieval, real time retrieval system and method, and computer storage medium
CN102332030A (en) Data storing, managing and inquiring method and system for distributed key-value storage system
CN111460024A (en) Real-time service system based on Elasticissearch
CN106611053A (en) Data cleaning and indexing method
CN103970902A (en) Method and system for reliable and instant retrieval on situation of large quantities of data
CN103020255A (en) Hierarchical storage method and hierarchical storage device
CN108509437A (en) A kind of ElasticSearch inquiries accelerated method
CN103744913A (en) Database retrieval method based on search engine technology
CN106599040A (en) Layered indexing method and search method for cloud storage
Gomes et al. An infrastructure model for smart cities based on big data
CN108009290A (en) A kind of data modeling and storage method of track traffic command centre gauze big data
CN103226608A (en) Parallel file searching method based on folder-level telescopic Bloom Filter bit diagram
CN105574188A (en) Method and system for managing data in different dimensions and at different layers
WO2022143017A1 (en) Method and apparatus for constructing traffic data warehouse, storage medium, and terminal
CN103034650A (en) System and method for processing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191231

Address after: 310012 Xihu District, Hangzhou, West Gate Road, No., Paradise Software Park, building D, block 17, block ABCD, 3

Applicant after: Zhejiang Bang Sheng Technology Co., Ltd.

Applicant after: CRRC TANGSHAN CO., LTD.

Address before: 310012 Xihu District, Hangzhou, West Gate Road, No., Paradise Software Park, building D, block 17, block ABCD, 3

Applicant before: Zhejiang Bang Sheng Technology Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190628