CN107562946A - A kind of method that concordance list is created in big data system - Google Patents

A kind of method that concordance list is created in big data system Download PDF

Info

Publication number
CN107562946A
CN107562946A CN201710879944.4A CN201710879944A CN107562946A CN 107562946 A CN107562946 A CN 107562946A CN 201710879944 A CN201710879944 A CN 201710879944A CN 107562946 A CN107562946 A CN 107562946A
Authority
CN
China
Prior art keywords
data
file
index
row
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710879944.4A
Other languages
Chinese (zh)
Inventor
黄礼成
张蓉
姜雪
耿鹏舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Harlu Mdt Infotech Ltd
Original Assignee
Nanjing Harlu Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Harlu Mdt Infotech Ltd filed Critical Nanjing Harlu Mdt Infotech Ltd
Priority to CN201710879944.4A priority Critical patent/CN107562946A/en
Publication of CN107562946A publication Critical patent/CN107562946A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of method that concordance list is created in big data system.The present invention includes(1)Metadata storage based on data dictionary, accelerates calculating speed, data are only just converted into user's readable form when user is returned result to using dictionary encoding;(2)Multidimensional data is assembled:Data are reorganized by multiple dimensions in storage, make data in " more cohesion in hyperspace ";(3)The row of tape index deposit file structure:The index of multiple ranks for multiclass Scenario Design, and incorporated characteristics of some search, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the minmax indexes of each column, and the inverted index in arranging;(4)Row group:It is that a kind of row deposit structure on the whole.The present invention is easy to the history from magnanimity, quick obtaining useful information in real time data.

Description

A kind of method that concordance list is created in big data system
Technical field:
The present invention relates to a kind of method that concordance list is created in big data system, belong to Internet technical field.
Background technology:
With the explosive growth of internet data scale, how from the useful letter of quick obtaining in the history, real time data of magnanimity Breath, becomes more and more challenging.Search is to obtain one of most efficient approach of information, therefore is also all kinds of websites, application Basic standard configuration function.Developer wants to realize that function of search is typically all based on some search system of increasing income in the product of oneself (Such as ElasticSearch, Solr, Sphinx)Build search service.However, except purchase main frame or Entrust Server, from being System is familiar with, service is built, customizing functions, then is reached the standard grade to service, it usually needs consumes a longer time.
The content of the invention:
The purpose of the present invention is to provide a kind of method that concordance list is created in big data system for above-mentioned problem, is easy to Quick obtaining useful information in history, real time data from magnanimity.
Above-mentioned purpose is realized by following technical scheme:
A kind of method that concordance list is created in big data system, this method include:
(1)Metadata storage based on data dictionary, accelerates calculating speed, it causes processing/inquiry to draw using dictionary encoding Hold up directly to be handled in the data encoded and only returning result to user's without change data, data When be just converted into the readable form of user;
(2)Multidimensional data is assembled:Data are reorganized by multiple dimensions in storage, make data " in hyperspace More cohesion ", obtains more preferable compression ratio in storage, computationally obtains more preferable data filtering efficiency;
(3)The row of tape index deposit file structure:The index of multiple ranks for multiclass Scenario Design, and incorporated some search Characteristic, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the inverted index in the minmax indexes of each column, and row; Index and data file store together, and part index inherently data, another part indexes the first number for being stored in file According in structure;
(4)Row group:That a kind of row deposit structure on the whole, user can using it is some not frequently as filter condition but need as tying The field that fruit collection returns stores as row group, understands after encoded and stores these fields to be lifted using the capable mode that deposit Query performance.
Beneficial effect:
The present invention is easy to the history from magnanimity, quick obtaining useful information in real time data.
Embodiment:
Embodiment 1:
The method that concordance list is created in the big data system of the present embodiment, this method include:
(1)Metadata storage based on data dictionary, focuses in the optimization to data tissue, is finally by data tissue Lift IO performances and calculate performance, Global Dictionary is encoded to accelerate calculating speed, and it allows processing/query engine direct Handled in the data encoded without change data.Data are just changed only when user is returned result to The form readable into user.
(2)Multidimensional data is assembled:Data are reorganized by multiple dimensions in storage, make data in " multidimensional sky Between on more cohesion ", more preferable compression ratio is obtained in storage, computationally obtains more preferable data filtering efficiency.
(3)The row of tape index deposit file structure:The index of multiple ranks for multiclass Scenario Design, and incorporated some and searched The characteristic of rope, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the row of falling in the minmax indexes of each column, and row Index etc..Secondly, in order to adapt to HDFS storage characteristics, index and data file store together, and part index is inherently It is data, another part index is stored in the metadata structure of file, and they can provide the access energy of localization with HDFS Power.
(4)Row group:It is that a kind of row deposit structure on the whole, but for row is deposited, row deposit structure in reply detailed data The problem of data convert cost is high is had during inquiry, so in order to lift obvious data query performance, supports the storage side of row group Formula, user can using it is some not frequently as filter condition but need the field for collecting return as a result to be stored as row group, Understand after encoded and store these fields to lift query performance using the capable mode that deposit.
Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned technological means, in addition to The technical scheme being made up of above technical characteristic equivalent substitution.The unaccomplished matter of the present invention, belongs to those skilled in the art's Common knowledge.

Claims (1)

1. the method for concordance list is created in a kind of big data system, it is characterized in that:This method includes:
(1)Metadata storage based on data dictionary, accelerates calculating speed, it causes processing/inquiry to draw using dictionary encoding Hold up directly to be handled in the data encoded and only returning result to user's without change data, data When be just converted into the readable form of user;
(2)Multidimensional data is assembled:Data are reorganized by multiple dimensions in storage, make data " in hyperspace More cohesion ", obtains more preferable compression ratio in storage, computationally obtains more preferable data filtering efficiency;
(3)The row of tape index deposit file structure:The index of multiple ranks for multiclass Scenario Design, and incorporated some search Characteristic, there is a multi-dimensional indexing across file, the multi-dimensional indexing in file, the inverted index in the minmax indexes of each column, and row; Index and data file store together, and part index inherently data, another part indexes the first number for being stored in file According in structure;
(4)Row group:That a kind of row deposit structure on the whole, user can using it is some not frequently as filter condition but need as tying The field that fruit collection returns stores as row group, understands after encoded and stores these fields to be lifted using the capable mode that deposit Query performance.
CN201710879944.4A 2017-09-26 2017-09-26 A kind of method that concordance list is created in big data system Pending CN107562946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710879944.4A CN107562946A (en) 2017-09-26 2017-09-26 A kind of method that concordance list is created in big data system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710879944.4A CN107562946A (en) 2017-09-26 2017-09-26 A kind of method that concordance list is created in big data system

Publications (1)

Publication Number Publication Date
CN107562946A true CN107562946A (en) 2018-01-09

Family

ID=60981744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710879944.4A Pending CN107562946A (en) 2017-09-26 2017-09-26 A kind of method that concordance list is created in big data system

Country Status (1)

Country Link
CN (1) CN107562946A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866358A (en) * 2010-06-12 2010-10-20 中国科学院计算技术研究所 Multidimensional interval querying method and system thereof
US8495007B2 (en) * 2008-08-28 2013-07-23 Red Hat, Inc. Systems and methods for hierarchical aggregation of multi-dimensional data sources
CN103218404A (en) * 2013-03-20 2013-07-24 华中科技大学 Multi-dimensional metadata management method and system based on association characteristics
CN103366015A (en) * 2013-07-31 2013-10-23 东南大学 OLAP (on-line analytical processing) data storage and query method based on Hadoop
CN104268158A (en) * 2014-09-03 2015-01-07 深圳大学 Structural data distributed index and retrieval method
CN104715039A (en) * 2015-03-23 2015-06-17 星环信息科技(上海)有限公司 Column-based storage and research method and equipment based on hard disk and internal storage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495007B2 (en) * 2008-08-28 2013-07-23 Red Hat, Inc. Systems and methods for hierarchical aggregation of multi-dimensional data sources
CN101866358A (en) * 2010-06-12 2010-10-20 中国科学院计算技术研究所 Multidimensional interval querying method and system thereof
CN103218404A (en) * 2013-03-20 2013-07-24 华中科技大学 Multi-dimensional metadata management method and system based on association characteristics
CN103366015A (en) * 2013-07-31 2013-10-23 东南大学 OLAP (on-line analytical processing) data storage and query method based on Hadoop
CN104268158A (en) * 2014-09-03 2015-01-07 深圳大学 Structural data distributed index and retrieval method
CN104715039A (en) * 2015-03-23 2015-06-17 星环信息科技(上海)有限公司 Column-based storage and research method and equipment based on hard disk and internal storage

Similar Documents

Publication Publication Date Title
CN104536959B (en) A kind of optimization method of Hadoop accessing small high-volume files
CN103366015B (en) A kind of OLAP data based on Hadoop stores and querying method
Martínez-Prieto et al. Exchange and consumption of huge RDF data
JP6964384B2 (en) Methods, programs, and systems for the automatic discovery of relationships between fields in a mixed heterogeneous data source environment.
US10013440B1 (en) Incremental out-of-place updates for index structures
CN105117502A (en) Search method based on big data
CN103714096A (en) Lucene-based inverted index system construction method and device, and Lucene-based inverted index system data processing method and device
US20150095341A1 (en) System and a method for hierarchical data column storage and efficient query processing
CN104778182A (en) Data import method and system based on HBase (Hadoop Database)
CN103207864A (en) Online novel content similarity comparison method
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN108319608A (en) The method, apparatus and system of access log storage inquiry
Sarlis et al. Datix: A system for scalable network analytics
CN113779349A (en) Data retrieval system, apparatus, electronic device, and readable storage medium
CN104765767A (en) Knowledge storage algorithm for intelligent learning
Haque et al. Distributed RDF triple store using hbase and hive
CN110781210A (en) Data processing platform for multi-dimensional aggregation real-time query of large-scale data
Ravindra et al. Efficient processing of RDF graph pattern matching on MapReduce platforms
CN107562946A (en) A kind of method that concordance list is created in big data system
Huang et al. Pisa: An index for aggregating big time series data
Bao et al. Query optimization of massive social network data based on hbase
CN115114293A (en) Database index creating method, related device, equipment and storage medium
CN107844546A (en) A kind of file system metadata management system and method
CN103891244B (en) A kind of method and device carrying out data storage and search
Habbal et al. BIND: An indexing strategy for big data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180109

RJ01 Rejection of invention patent application after publication