CN107562946A

CN107562946A - A kind of method that concordance list is created in big data system

Info

Publication number: CN107562946A
Application number: CN201710879944.4A
Authority: CN
Inventors: 黄礼成; 张蓉; 姜雪; 耿鹏舒
Original assignee: Nanjing Harlu Mdt Infotech Ltd
Current assignee: Nanjing Harlu Mdt Infotech Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2018-01-09

Abstract

The present invention provides a kind of method that concordance list is created in big data system.The present invention includes（1）Metadata storage based on data dictionary, accelerates calculating speed, data are only just converted into user's readable form when user is returned result to using dictionary encoding；（2）Multidimensional data is assembled：Data are reorganized by multiple dimensions in storage, make data in " more cohesion in hyperspace "；（3）The row of tape index deposit file structure：The index of multiple ranks for multiclass Scenario Design, and incorporated characteristics of some search, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the minmax indexes of each column, and the inverted index in arranging；（4）Row group：It is that a kind of row deposit structure on the whole.The present invention is easy to the history from magnanimity, quick obtaining useful information in real time data.

Description

A kind of method that concordance list is created in big data system

Technical field：

The present invention relates to a kind of method that concordance list is created in big data system, belong to Internet technical field.

Background technology：

With the explosive growth of internet data scale, how from the useful letter of quick obtaining in the history, real time data of magnanimity Breath, becomes more and more challenging.Search is to obtain one of most efficient approach of information, therefore is also all kinds of websites, application Basic standard configuration function.Developer wants to realize that function of search is typically all based on some search system of increasing income in the product of oneself （Such as ElasticSearch, Solr, Sphinx）Build search service.However, except purchase main frame or Entrust Server, from being System is familiar with, service is built, customizing functions, then is reached the standard grade to service, it usually needs consumes a longer time.

The content of the invention：

The purpose of the present invention is to provide a kind of method that concordance list is created in big data system for above-mentioned problem, is easy to Quick obtaining useful information in history, real time data from magnanimity.

Above-mentioned purpose is realized by following technical scheme：

A kind of method that concordance list is created in big data system, this method include：

（1）Metadata storage based on data dictionary, accelerates calculating speed, it causes processing/inquiry to draw using dictionary encoding Hold up directly to be handled in the data encoded and only returning result to user's without change data, data When be just converted into the readable form of user；

（2）Multidimensional data is assembled：Data are reorganized by multiple dimensions in storage, make data " in hyperspace More cohesion ", obtains more preferable compression ratio in storage, computationally obtains more preferable data filtering efficiency；

（3）The row of tape index deposit file structure：The index of multiple ranks for multiclass Scenario Design, and incorporated some search Characteristic, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the inverted index in the minmax indexes of each column, and row； Index and data file store together, and part index inherently data, another part indexes the first number for being stored in file According in structure；

（4）Row group：That a kind of row deposit structure on the whole, user can using it is some not frequently as filter condition but need as tying The field that fruit collection returns stores as row group, understands after encoded and stores these fields to be lifted using the capable mode that deposit Query performance.

Beneficial effect：

The present invention is easy to the history from magnanimity, quick obtaining useful information in real time data.

Embodiment：

Embodiment 1：

The method that concordance list is created in the big data system of the present embodiment, this method include：

（1）Metadata storage based on data dictionary, focuses in the optimization to data tissue, is finally by data tissue Lift IO performances and calculate performance, Global Dictionary is encoded to accelerate calculating speed, and it allows processing/query engine direct Handled in the data encoded without change data.Data are just changed only when user is returned result to The form readable into user.

（2）Multidimensional data is assembled：Data are reorganized by multiple dimensions in storage, make data in " multidimensional sky Between on more cohesion ", more preferable compression ratio is obtained in storage, computationally obtains more preferable data filtering efficiency.

（3）The row of tape index deposit file structure：The index of multiple ranks for multiclass Scenario Design, and incorporated some and searched The characteristic of rope, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the row of falling in the minmax indexes of each column, and row Index etc..Secondly, in order to adapt to HDFS storage characteristics, index and data file store together, and part index is inherently It is data, another part index is stored in the metadata structure of file, and they can provide the access energy of localization with HDFS Power.

（4）Row group：It is that a kind of row deposit structure on the whole, but for row is deposited, row deposit structure in reply detailed data The problem of data convert cost is high is had during inquiry, so in order to lift obvious data query performance, supports the storage side of row group Formula, user can using it is some not frequently as filter condition but need the field for collecting return as a result to be stored as row group, Understand after encoded and store these fields to lift query performance using the capable mode that deposit.

Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned technological means, in addition to The technical scheme being made up of above technical characteristic equivalent substitution.The unaccomplished matter of the present invention, belongs to those skilled in the art's Common knowledge.

Claims

1. the method for concordance list is created in a kind of big data system, it is characterized in that：This method includes：