CN108241725B - A kind of data hot statistics system and method - Google Patents

A kind of data hot statistics system and method Download PDF

Info

Publication number
CN108241725B
CN108241725B CN201710374717.6A CN201710374717A CN108241725B CN 108241725 B CN108241725 B CN 108241725B CN 201710374717 A CN201710374717 A CN 201710374717A CN 108241725 B CN108241725 B CN 108241725B
Authority
CN
China
Prior art keywords
queried
column
data
server
hive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710374717.6A
Other languages
Chinese (zh)
Other versions
CN108241725A (en
Inventor
吴宏志
韩东亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201710374717.6A priority Critical patent/CN108241725B/en
Priority to PCT/CN2018/088195 priority patent/WO2018214936A1/en
Publication of CN108241725A publication Critical patent/CN108241725A/en
Application granted granted Critical
Publication of CN108241725B publication Critical patent/CN108241725B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of data hot statistics system and method, which includes client and server-side, and client includes service resolution module and Client Interface module, and server-side includes server-side interface module and memory module;Service resolution module, for obtaining the table/column information being queried in Hive data warehouse and being queried the time;Client Interface module, the table/column information being queried for will acquire and is queried the time and is sent to server-side interface module;Server-side interface module, for receiving above content, the table/column are mapped to data access logic component predetermined by the statistics number that the table/column are queried within the nearest M unit time;The data temperature of the table/column is determined according to the heat degree threshold of data access logic component;Memory module, the data temperature for recording the received table/column information being queried of server-side interface module, being queried time and the table/column.

Description

A kind of data hot statistics system and method
Technical field
This application involves database technical field more particularly to a kind of data hot statistics system and method.
Background technique
As data magnanimity increases severely, single computer cannot store mass data, therefore, distributed type assemblies by Extensive concern.In distributed type assemblies, data distribution can be stored into multiple stage computers and may be implemented to be distributed Formula calculates.Hadoop is the architecture of distributed system, and user can open without understanding the details of the distributed bottom layer Distributed program is sent out, the ability of inexpensive computers cluster is made full use of to carry out high speed computing and storage to data.
Hive is built upon the data warehouse base frame on Hadoop.It provides a series of tool, can be used to It carries out data and extracts conversion load (ETL), this is a kind of extensive number that can store, inquire and analyze and be stored in Hadoop According to mechanism.Hive defines simple class SQL query language, referred to as HQL (HiveQL), and the user that it allows to be familiar with SQL looks into Ask data.
Summary of the invention
The application provides a kind of data hot statistics system and method, to count to the data in Hive data warehouse According to hot statistics.
Specifically, the application is achieved by the following technical solution:
The application is in a first aspect, provide a kind of data hot statistics system, including client and server-side, the client End includes service resolution module and Client Interface module, and the server-side includes server-side interface module and memory module, In:
The service resolution module, for obtain the table and/or column that are queried in Hive data warehouse information and by Query time;
The Client Interface module, the letter of the table being queried and/or column for obtaining the service resolution module It ceases and is queried the time and be sent to the server-side interface module;
The server-side interface module, for receiving the information of the table and/or column being queried and being queried the time; It is also used to when receiving statistics instruction or measurement period reaches, counts the table being queried within the nearest M unit time And/or what is arranged is queried number, and the table being queried and/or column are mapped to data access logic component predetermined;According to The heat degree threshold of the data access logic component, determine described in the data temperature of table and/or column that is queried;The M is greater than 0 Integer;
The memory module, for recording the information of the received table being queried of the server-side interface module and/or column And it is queried the time;It is also used to record the data temperature of the table and/or column being queried.
The application second aspect, provides a kind of data hot statistics method, and the method is applied to server-side, the side Method includes:
It receives the information of the table and/or column that are queried in Hive data warehouse and is queried the time;
When receiving statistics instruction or measurement period reaches, statistics within the nearest M unit time described in be queried What table and/or column were queried is queried number, and the table being queried and/or column are mapped to data access mould predetermined Type;The M is the integer greater than 0;
According to the heat degree threshold of the data access logic component, the data temperature of the table and/or column that are queried described in determination.
As can be seen from the above technical solutions, in the application, being looked into table in Hive data warehouse and/or column is realized The statistics of number is ask, and realizes the hot statistics to table and/or column in Hive data warehouse, and the system in the application Meter rank can be as accurate as the rank of column.
Detailed description of the invention
Fig. 1 is the technical framework diagram of Hive;
Fig. 2 is a kind of architecture diagram of data hot statistics system provided by the present application;
Fig. 3 is a kind of schematic diagram of random access model provided by the present application;
Fig. 4 is a kind of schematic diagram of incremental Access Model provided by the present application;
Fig. 5 is a kind of schematic diagram of Access Model of successively decreasing provided by the present application;
Fig. 6 is a kind of schematic diagram of cycle access model provided by the present application;
Fig. 7 is the architecture diagram of another data hot statistics system provided by the present application;
Fig. 8 is a kind of flow chart of data hot statistics method provided by the present application.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
Hereinafter, simply being introduced Hive first.
With reference to Fig. 1, for the Technical Architecture of common Hive a kind of, main includes such as lower component: user interface, Driver (driving) and Hive service (Server).
User interface: including CLI (command line interface, command line interface), JDBC (Java The connection of Database Connectivity, Java database) (Open DataBase Connectivity, is opened interface/ODBC Put database connection) interface and Web GUI (Graphical User Interface, graphical user interface).
Driving: including Compiler (compiler), Optimizer (optimizer) and Executor (actuator), realization pair Morphological analysis, syntactic analysis and the semantic analysis of class SQL query language HQL, is finally converted into bottom enforcement engine for HQL sentence Being able to carry out for task.
Hive Server: it can be used to carry out the exploitation of service expansible and across language.Hive Server includes two A version: Hive Server1 and Hive Server2.Hive Server1 can only handle the request of a client simultaneously, Hive Server2 is the upgrade version of Hive Server1, can handle the request of multiple client simultaneously.Hive Server exists A physical equipment can be corresponded in practical application, which can be X86-based equipment, such as can be X86 service Device.
Hive is that data management brings many conveniences, it can be used for management of and query structure or unstructured data. But the function that existing Hive is provided is limited, such as current Hive wouldn't support to carry out the data in Hive data warehouse Hot statistics.
In order to make up this partial function that current Hive is short of, this application provides a kind of data hot statistics system, The system can enumerate two kinds there are many framework here.
The first system architecture:
Referring to FIG. 2, being a kind of framework of data hot statistics system, including client 21 and server-side 22, wherein visitor Family end 21 may include hook program module 211, service resolution module 212 and Client Interface module 213;Server-side 22 can be with Including server-side interface module 221, memory module 222 and web (webpage) module 223.
Based on this system architecture, in deploying client 21 and server-side 22, need for client 21 to be deployed in Fig. 1 institute Show on the node device where the Hive Server in Hive framework.When specific implementation, a kind of feasible mode is by client 21 are deployed under the lib subdirectory of Hive installation directory, are that client 21 is deployed in Hive peace there are also a kind of feasible mode Fill catalogue any one subdirectory under, while by the deployment path of client 21 write-in Hive CLASSPATH environmental variance In, to tell Hive Server node device can find the client 21 under which catalogue.As for the deployment of server-side 22 Position the present embodiment then with no restriction, as long as net between node device and Hive Server node device where server-side 22 Network is reachable.In practical application, client 21 and server-side 22 1 can be led to and be deployed in Hive Server node device On, the advantage disposed together is that a node device can be saved.
It has disposed after client 21 and server-side 22, it is also necessary to configure the core document of Hive, which can be with It is hive-site.xml file, configuration process can be the complete class.path for the hook program module 211 for including by client 21 It is configured in the hive.exec.post.hooks item in hive-site.xml file.Configuration core document purpose be in order to Hive Server node device is set to know the presence of the hook program module 211, so as to call the hook on specific opportunity Program module 211 triggers the information to the table and column being queried in Hive data warehouse and is queried the collection of time.
For example, when Hive Server node device detects the instruction for Hive data warehouse, hook program module 211 can be called by Hive Server node device, can further trigger calling industry after hook program module 211 is called Business parsing module 212.
The org.apache.hadoop.hi under the hive-exec packet that Hive includes is realized in hook program module 211 Ve.ql.hooks.ExecuteWithHookContext interface, Hive Server node device can pass through on specific opportunity This interface calls this hook program module 211, the specific opportunity can be Hive Server node device by CLI, JDBC/ODBC, Web GUI or the interface of other forms receive the instruction for Hive data warehouse.
Hive Server node device can call hook program module 211 by the way of asynchronous call, i.e., on the one hand Hook program module 211 is called, Hive data warehouse is on the one hand handled according to the movement of instruction instruction.
Service resolution module 212 for parsing the above-mentioned instruction for Hive data warehouse, and is determining that the instruction is to look into When asking instruction, the information of the table and/or column that are queried and being queried for record sheet and/or column are obtained from the inquiry instruction Time.
Specifically, if in the instruction including Select sentence, service resolution module 212 can determine that the instruction is to look into Ask instruction.
Client Interface module 213, the information of the table being queried and/or column for obtaining service resolution module 212 And it is queried the time and is sent to server-side interface module 221.
The application is not intended to limit the interface type of Client Interface module 213, and can be can arbitrarily transmit data Interface, such as REST (Representational State Transfer Client) interface etc..
Server-side interface module 221, for receiving the table of the transmission of client-side interface 213 being queried and/or the information of column And the table and/or column are queried the time;It is also used to count the table being queried and/or column within the nearest M unit time Be queried number, the table being queried and/or column are mapped to by data access logic component predetermined according to statistical result;Root According to the heat degree threshold of the data access logic component, the data temperature of the table being queried and/or column is determined;The M is whole greater than 0 Number.
Here unit time can be year, season, the moon, week, day, hour, minute, second etc..
Memory module 222, for record the information of the received table being queried of server-side interface module 221 and/or column with And it is queried the time;It is also used to record the data temperature of the table being queried and/or column.
In the application, memory module 222 can by the information of the table being queried and/or column, be queried time and table and/ Or the data temperature of column is stored into any database, such as SQL Server.
Web module 223 for the data temperature to 221 required list of server-side interface module and/or column and shows, also uses In thinking that server-side interface module 221 sends the instruction of above-mentioned statistics.
Correspondingly, for cooperation web module 233, server-side interface module 221 is also used to receiving asking for web module 223 After asking, the data temperature that the request obtains requested table and/or column from memory module 222 is responded, web module is then returned to 223。
In the application, to the data hot statistics of table and/or column in Hive data warehouse, mainly include by server-side 22 Server-side interface module 221 complete.
In the application, data access logic component can be pre-defined, including random access model, be incremented by Access Model, pass Subtract Access Model and cycle access model.The feature of each data access logic component may refer to shown in Fig. 3, Fig. 4, Fig. 5 and Fig. 6, figure In abscissa indicate the time, ordinate indicate inquiry times.
The data of table or column are in initially deposit Hive data warehouse, due to supporting to incite somebody to action there are no enough amount of access at this time The table or the column are mapped to some specific data access logic component, thus have in Hive data warehouse newly-increased table and/or When column, the information of server-side interface module 221 available newly-increased table and/or column, and determine the number of the newly-increased table or column According to for dsc data.So-called dsc data refers to the data being used frequently, correspondingly, cold data refers to the data being almost not used.
Over time, for example, from the data of table or column be stored in Hive data warehouse by above-mentioned M unit when Between after, server-side interface module 221 can receive statistics instruction or measurement period reach when, just to Hive data warehouse It is middle to meet the table or a column data hot statistics of progress that the access time is more than M unit time.It is super for meeting the access time Any table or any one column of M unit time are crossed, server-side interface module 221 every time can be with base when carrying out data statistics It in the table or is listed in the nearest M unit time and is queried number, the table or column are mapped to a data access logic component, so Afterwards according to the heat degree threshold for the data access logic component being mapped to, the data temperature of the table or column is determined.
Particularly, if server-side interface module 221 is determined most when receiving statistics instruction or measurement period reaches There is the table and/or column not being queried in Hive data warehouse in the nearly M unit time, then can directly determine this and not be queried Table and/or be classified as cold data, without carrying out the operation of data access logic component mapping.
Optionally, when the heat degree threshold of data access logic component includes absolute threshold, server-side interface module 221 can be with For: the table that is queried or column are queried number in statistics preset time, judge being queried time for the table being queried or column Otherwise whether number, which is less than the absolute threshold, determines the table being queried if then determining the table being queried or being classified as cold data Or it is classified as dsc data.
Optionally, when the heat degree threshold of data access logic component includes relative threshold, server-side interface module 221 can be with For: the table being queried in statistics preset time be queried number and all tables be queried number, calculate this and be queried Table the ratio for being queried number for being queried number Yu all tables, judge whether the ratio is less than the relative threshold, if It then determines that the table being queried is cold data, otherwise determines that the table being queried is dsc data;And quilt in statistics preset time All column of the column of inquiry being queried in table belonging to number and the column are queried number, calculate the column being queried It is queried the ratio for being queried number of all column in table belonging to number and the column, it is opposite to judge whether the ratio is less than this Otherwise threshold value determines that this was queried is classified as dsc data if determining that this was queried is classified as cold data.
When the data access logic component that table or column are mapped to is random access model, is incremented by Access Model or Access Model of successively decreasing When, above-mentioned described preset time is nearest N number of unit time, which is the integer greater than 0;When the data that table or column are mapped to When Access Model is cycle access model, above-mentioned described preset time is nearest a cycle.It should be noted that here N may be the same or different with M above, not stringent size relation.
For data hot statistics based on obtained table and column as a result, can be there are many purposes, relatively common has data raw Order cycle management etc..The application only enumerates following two purposes:
For example, memory module 223 can be also used for: root after server-side interface module 221 obtains the data temperature of table According to the data temperature of table, by the table for belonging to dsc data, there are performances preferably to store in equipment, and the table for belonging to cold data is deleted Or there are in the poor storage equipment of performance.
In another example memory module 223 can be also used for after the data temperature that server-side interface module 221 is arranged: According to the data temperature of column, column for belonging to dsc data that same table includes and the column for belonging to cold data are respectively stored into difference File in.
So far, the description to system shown in Figure 1 is completed.
Second of system architecture:
Referring to FIG. 7, for the framework of another data hot statistics system provided by the present application, including 21 kimonos of client Business end 22, client 21 include service resolution module 212 and client-side interface 213;Server-side includes server-side interface module 221, memory module 222 and web module 223.
System shown in Figure 7 and system shown in Figure 2 main difference is that:
1) client 21 in system shown in Figure 7 does not include hook program module 211.
2) obtain the information of table and/or column being queried in Hive data warehouse and be queried the time by way of slightly area Not.In the present embodiment, due to not including hook program module 211, service resolution module 212 mainly passes through parsing Hive The Hive log saved on Server node device therefrom obtains the information and table and/or column of the table and/or column that are queried Be queried the time.The Hive log can be generated and be saved by Hive Server node device.
3) deployed position of client 21.In the present embodiment, client 21 is deployed in Hive Server node without forcing In equipment, as long as client 21 can get the Hive log saved on Hive Server node device.Correspondingly, working as When client 21 is not deployed on Hive Server node device, without the core document of configuration Hive.
In addition to above 3 points, other each modules, such as client-side interface 213, server-side interface module 221, memory module 222 It is identical as the function of module each in system shown in Figure 2 with the function of web module 223, it does not repeat here.
So far, the description to system shown in Figure 7 is completed.
Below by Fig. 8, illustrate that server-side realizes the detailed process of data hot statistics.Referring to Fig. 8, which may include Following steps:
Step 801: when server-side receives the information of the table and/or column that are queried in Hive data warehouse and is queried Between.
As described above, the information of the table and/or column that are queried in Hive data warehouse and the time is queried by client End obtains and is sent to server-side.Client can be queried in parsing from the inquiry instruction for Hive data warehouse The information of table and/or column, and the time is queried by what system time when receiving inquiry instruction was denoted as the table and/or column;Or Person parses the table and/or column being queried in the Hive log that client can be saved from Hive Server node device Information and be queried the time.
Step 802: server-side is counted when receiving statistics instruction or measurement period reaches in the nearest M unit time The interior table being queried and/or column are queried number, and the table being queried and/or column are mapped to data predetermined and visited Ask model;The M is the integer greater than 0.
Here unit time can be year, season, the moon, week, day, hour, minute, second etc..Assuming that as unit of one day, i.e., Be count Hive data warehouse in table or be listed in the total degree being queried in M days.
In practical application, since table and/or the initial a period of time for being listed in deposit Hive data warehouse interior are also not enough to reflect When being mapped to a certain specific data access logic component, therefore increasing table newly in Hive database and/or arrange, server-side is available new The table of increasing and/or the information of column directly determine the newly-increased table and/or are classified as dsc data.In the newly-increased table and/or column storage After time is more than M unit time, data temperature is carried out to the newly-increased table and/or column according still further to the mode of step 802 and 803 Statistics.
Particularly, when receiving statistics instruction or measurement period reaches, may exist in Hive data warehouse most The table and/or column not being queried in the nearly M unit time, for this kind of table and/or column not being queried, server-side can be direct Cold data is determined it as, without carrying out data access logic component mapping.
The application data access logic component predetermined includes random access model, incremental Access Model, successively decreasing accesses mould Type and cycle access model.Specific data access logic component example may refer to shown in Fig. 3, Fig. 4, Fig. 5 and Fig. 6, the cross in figure Coordinate representation time, ordinate expression are queried number.
Here server-side can by existing sorting algorithm, such as bayesian algorithm, neural network classification algorithm, according to The table that is queried in Hive data warehouse within the nearest M unit time and/or column are queried number, the table that this is queried And/or column are mapped to corresponding data access logic component.As long as table in Hive data warehouse and/or when being listed in nearest M unit Interior to be queried, then the table and/or column centainly may map to above-mentioned random access model, are incremented by Access Model, visit of successively decreasing Ask one of data access logic component in model and cycle access model.
Step 803: server-side according to the heat degree threshold of the data access logic component being mapped to, determine the table being queried and/ Or the data temperature of column.
For different data access logic components, there is different data temperature methods of determination.
For example, can there is following two mode to judge the table or column in the table or column that are mapped to random access model For cold data or dsc data:
Mode one: absolute threshold discriminant approach.
For table, can count within the nearest N number of unit time table is queried number (average time or total time Number), judge whether the number is less than absolute threshold as defined in random access model, if then determining that the table is cold data, otherwise Determine that the table is dsc data.
For column, can count within the nearest N number of unit time column is queried number (average time or total time Number), judge whether the number is less than absolute threshold as defined in random access model, if then determining that this is classified as cold data, otherwise Determine that this is classified as dsc data,
Here N is provided that N is the integer greater than 0 by the data access logic component being mapped to, i.e. random access model.
Illustrate by taking Fig. 3 as an example, it is assumed that using day as the unit time, N as defined in random access model is 7, absolute threshold 10 Secondary (average time).Data shown in Fig. 3 are daily in nearest 7 days, and to be queried number respectively be 7 times, 9 times, 2 times, 3 times, 7 Secondary, 8 times, 7 times calculate and know that the data being averaged in nearest 7 days is queried number and is 43/7 ≈ 6.14 times, due to calculating 6.14 times arrived are less than as defined in random access model 10 times, therefore the data are cold data.
Mode two: relative threshold discriminant approach.
For table, can count within the nearest N number of unit time table is queried number (average time or total time Number) and all tables in Hive data warehouse be queried number (average time or total degree), calculate being queried for the table The ratio for being queried number of number and all tables, judges whether the ratio is less than relative threshold as defined in random access model, If then determining that the table is cold data, otherwise determine that the table is dsc data.
For column, can count within the nearest N number of unit time column is queried number (average time or total time Number) and the column belonging to all column in table be queried number (average time or total degree), calculate being queried for the column The ratio for being queried number of all column in table belonging to number and the column, judges whether the ratio is less than random access model Otherwise defined relative threshold determines that this is classified as dsc data if then determining that this is classified as cold data.
Still illustrate by taking Fig. 3 as an example, it is assumed that using day as the unit time, N as defined in random access model is 7, and relative threshold is 10%, it is 100 times that overall data being averaged in nearest 7 days, which is queried number,.The data known to calculating are flat in nearest 7 days It is queried the 6.14/100=6.14% that number accounts for overall data, since the ratio 6.14% being calculated is less than random access 10% as defined in model, therefore the data are cold data.
For being mapped to the table or column of incremental Access Model or Access Model of successively decreasing, there can also be following two mode to sentence Break and the table or be classified as cold data or dsc data:
Mode one: absolute threshold mode.
For table, can count within the nearest N number of unit time table is queried number (average time or total time Number), judge whether the number is less than incremental Access Model or the absolute threshold as defined in Access Model that successively decreases, if then determining the table For cold data, otherwise determine that the table is dsc data.
For column, can count within the nearest N number of unit time column is queried number (average time or total time Number), judge whether the number is less than incremental Access Model or the absolute threshold as defined in Access Model that successively decreases, if then determining the column For cold data, otherwise determine that this is classified as dsc data.
Here N is incremented by Access Model or Access Model of successively decreasing regulation, N is by the data access logic component being mapped to Integer greater than 0.
By taking Fig. 4 as an example, it is assumed that using day as the unit time, being incremented by N as defined in Access Model is 7, and absolute threshold is (total 10 times Number).Data shown in Fig. 4 in nearest 7 days it is daily be queried number respectively and be 7 times, 7 times, 8 times, 9 times, 9 times, 10 times, It 10 times, calculates and knows that be always queried number of the data in nearest 7 days is 60 times, 60 times due to being calculated, which are greater than, to be incremented by 10 times as defined in Access Model, therefore the data are dsc data.
Mode two: relative threshold discriminant approach.
For table, can count within the nearest N number of unit time table is queried number (average time or total time Number) and all tables in Hive data warehouse be queried number (average time or total degree), calculate being queried for the table The ratio for being queried number of number and all tables, judges whether the ratio is less than incremental Access Model or Access Model rule of successively decreasing Otherwise fixed relative threshold determines that the table is dsc data if then determining that the table is cold data.
For column, can count within the nearest N number of unit time column is queried number (average time or total time Number) and the column belonging to all column in table be queried number (average time or total degree), calculate being queried time for the column Several ratios for being queried number with all column in table belonging to the column, judge the ratio whether be less than incremental Access Model or Successively decrease relative threshold as defined in Access Model, if then determining that this is classified as cold data, otherwise determines that this is classified as dsc data.
Still illustrate by taking Fig. 4 as an example, it is assumed that using day as the unit time, being incremented by N as defined in Access Model type is 7, relative threshold It is 10%, be always queried number of the overall data in nearest 7 days is 1000 times.The data are in nearest 7 days known to calculating It is always queried the 60/1000=6% that number accounts for overall data, is incremented by Access Model rule since the ratio 6% being calculated is less than Fixed 10%, therefore the data are cold data.
For being mapped to the table or column of cycle access model, there can also be following two mode to judge the table or be classified as cold Data or dsc data:
Mode one: absolute threshold mode.
For table, being queried number (average time) in the nearest a cycle table can be counted, judgement should Whether number is less than absolute threshold as defined in cycle access model, if then determining that the table is cold data, otherwise determines that the table is Dsc data.
For column, being queried number (average time) in the nearest a cycle column can be counted, judgement should Whether number is less than absolute threshold as defined in cycle access model, if then determining that this is classified as cold data, otherwise determines that this is classified as Dsc data.
By taking Fig. 6 as an example, it is assumed that using day as the unit time, being incremented by N as defined in Access Model is 7, and absolute threshold is 10 times.Figure The period of data shown in 6 is 9 days, and daily in nearest a cycle to be queried number respectively be 10 times, 12 times, 11 times, 8 Secondary, 6 times, 4 times, 6 times, 8 times, 11 times, calculating and knowing that the data being averaged in nearest a cycle is queried number is 70/9 ≈ 7.78 times, it is less than as defined in cycle access model 10 times as be calculated 7.78 times, therefore the data are cold data.
Mode two: relative threshold discriminant approach.
For table, can count in the nearest a cycle table is queried number (average time or total time Number) and all tables in Hive data warehouse be queried number (average time or total degree), calculate being queried for the table The ratio for being queried number of number and all tables, judges whether the ratio is less than relative threshold as defined in cycle access model, If then determining that the table is cold data, otherwise determine that the table is dsc data.
For column, can count in the nearest a cycle column is queried number (average time or total time Number) and the column belonging to all column in table be queried number (average time or total degree), calculate being queried for the column The ratio for being queried number of all column in table belonging to number and the column, judges whether the ratio is less than cycle access model Defined relative threshold, if will then determine that this is classified as cold data, otherwise determines that this is classified as dsc data.
Still illustrate by taking Fig. 6 as an example, it is assumed that using day as the unit time, relative threshold as defined in cycle access model is 10%, be always queried number of the overall data in nearest a cycle (in i.e. 9 days) is 1000 times.The data known to calculating exist The 70/1000=7% for being always queried number and accounting for overall data in nearest a cycle, 7% due to being calculated is less than week 10% as defined in phase Access Model, therefore the data are cold data.
, can be there are many purposes based on the data temperature of obtained table and column, relatively common has data life period pipe Reason etc..The application only enumerates following two purposes:
1) according to the data temperature of table, can by the table for belonging to dsc data, there are performances preferably to store in equipment, will belong to In in the table deletion of cold data or the storage equipment poor there are performance.
2) according to the data temperature of column, can include by same table belongs to the column of dsc data and belongs to the column of cold data It is respectively stored into different files.
According to existing realization, Hive, which is used, presses row memory mechanism, and a table can be respectively stored in multiple files, and one File stores the data of all column simultaneously.In this way when receiving the inquiry instruction for the column for belonging to dsc data, unavoidably Ground can read the data for belonging to the column of cold data in same file.
In such a way that the cold and hot data of the application are stored separately, the inquiry instruction for the column for belonging to dsc data is being received When, data to be checked can be obtained from the file dedicated for storing the column for belonging to dsc data, it is big so as to reduce Partial query operates the consumption to disk I/O, and can promote search efficiency.
It, can also be to the data temperature of table or column, including table or column after obtaining the data temperature of table and column in the application The cold and hot shape of data for being queried data access logic component, table or column that number, table or column are mapped within the nearest M unit time State etc. is presented.Specifically, can be presented by way of web page.
So far, the description to process shown in Fig. 8 is completed.
Can be seen that the application by process shown in Fig. 8 realizes to the table and/or column data in Hive data warehouse Hot statistics, and the statistics rank in the application can be as accurate as column (i.e. field) rank.
The foregoing is merely the preferred embodiments of the application, not to limit the application, all essences in the application Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the application protection.

Claims (16)

1. a kind of data hot statistics system, which is characterized in that including client and server-side, the client includes business solution It analyses module and Client Interface module, the server-side includes server-side interface module and memory module, in which:
The service resolution module, for obtaining the information of the table and/or column that are queried in Hive data warehouse and being queried Time;
The Client Interface module, the information of the table being queried and/or column for obtaining the service resolution module with And it is queried the time and is sent to the server-side interface module;
The server-side interface module, for receiving the information of the table and/or column being queried and being queried the time;Also use The table that is queried described in counting within the nearest M unit time when receiving statistics instruction or measurement period reaches and/or Column are queried number, and the table being queried and/or column are mapped to data access logic component predetermined;According to the number According to the heat degree threshold of Access Model, the data temperature of the table and/or column that are queried described in determination;The M is the integer greater than 0; The data access logic component includes random access model, is incremented by Access Model, Access Model of successively decreasing and cycle access model;
The memory module, for record the received table being queried of the server-side interface module and/or column information and It is queried the time;It is also used to record the data temperature of the table and/or column being queried.
2. the system as claimed in claim 1, which is characterized in that
The server-side interface module when being also used to have newly-increased table and/or column in the Hive data warehouse, obtains newly-increased Table and/or column information, determine the newly-increased table and/or be classified as dsc data;
The memory module, be also used to record the newly-increased table and/or column information and the newly-increased table and/or column Data temperature.
3. the system as claimed in claim 1, which is characterized in that
The server-side interface module is also used to determine single at nearest M when receiving statistics instruction or measurement period reaches The table that is not queried in the Hive data warehouse in the time of position and/or it is classified as cold data;
The memory module is also used to record the data temperature of the table and/or column not being queried.
4. system as described in any one of claims 1 to 3, which is characterized in that
The memory module is also used to the data temperature according to column, by column for belonging to dsc data that same table includes and belongs to The column of cold data are respectively stored into different files.
5. the system as claimed in claim 1, which is characterized in that
The service resolution module is also used to parse the Hive log saved on Hive service node device, obtains and is queried The information of table and/or column and it is queried the time;
Alternatively,
The client further includes hook program module, when Hive service node device detects the finger for Hive data warehouse It is called when enabling, calls the service resolution module for triggering;
The service resolution module is also used to parse described instruction, and when determining described instruction is inquiry instruction, looks into from described That askes the information and record sheet and/or column that the table and/or column that are queried are obtained in instruction is queried the time.
6. the system as claimed in claim 1, which is characterized in that the heat degree threshold of the data access logic component includes absolute threshold When, the server-side interface module is also used to:
The table that is queried or column are queried number in statistics preset time, and the table being queried described in judgement or column are queried time Whether number is less than the absolute threshold, if the table that is queried described in then determining or being classified as cold data, otherwise determines described looked into The table of inquiry is classified as dsc data.
7. the system as claimed in claim 1, which is characterized in that the heat degree threshold of the data access logic component includes relative threshold When, the server-side interface module is also used to:
The table being queried in statistics preset time be queried number and all tables be queried number, be queried described in calculating Table the ratio for being queried number for being queried number Yu all tables, judge whether the ratio is less than the relative threshold, If the table that is queried described in then determining is cold data, otherwise determine described in the table that is queried be dsc data;
All column of the column being queried in statistics preset time being queried in table belonging to number and the column are queried Number, the ratio for being queried number for all column of the column being queried described in calculating being queried in table belonging to number and the column Value, judges whether the ratio is less than the relative threshold, if what is be queried described in then determining is classified as cold data, otherwise determines It is described be queried be classified as dsc data.
8. system as claimed in claims 6 or 7, which is characterized in that
The data access logic component be random access model, be incremented by Access Model or successively decrease Access Model when, the preset time For nearest N number of unit time, the N is the integer greater than 0;
When the data access logic component is cycle access model, the preset time is nearest a cycle.
9. system as described in any one of claims 1 to 3, which is characterized in that
The server-side further includes webpage web module, for the data heat to the server-side interface module required list and/or column It spends and shows, be also used to send the statistics instruction to the server-side interface module;
The server-side interface module is also used to respond the request of the web module from the memory module acquisition table and/or column Data temperature, and return to the web module.
10. a kind of data hot statistics method, which is characterized in that the method is applied to server-side, which comprises
It receives the information of the table and/or column that are queried in Hive data warehouse and is queried the time;
When receiving statistics instruction or measurement period reaches, count within the nearest M unit time table being queried with/ Or what is arranged is queried number, and the table being queried and/or column are mapped to data access logic component predetermined;The M is Integer greater than 0;The data access logic component includes random access model, is incremented by Access Model, Access Model of successively decreasing and period Access Model;
According to the heat degree threshold of the data access logic component, the data temperature of the table and/or column that are queried described in determination.
11. method as claimed in claim 10, which is characterized in that the method also includes:
When increasing table and/or column newly in the Hive data warehouse, the information of newly-increased table and/or column is obtained, is determined described new The table of increasing and/or it is classified as dsc data.
12. method as claimed in claim 10, which is characterized in that the method also includes:
When receiving statistics instruction or measurement period reaches, the Hive data warehouse described within the nearest M unit time is determined In the table that is not queried and/or be classified as cold data.
13. such as the described in any item methods of claim 10 to 12, which is characterized in that the method also includes:
According to the data temperature of column, column for belonging to dsc data that same table includes and the column for belonging to cold data are respectively stored into In different files.
14. method as claimed in claim 10, which is characterized in that the heat degree threshold of the data access logic component includes absolute threshold Value;
According to the heat degree threshold of the data access logic component, the data temperature of the table or column that are queried is determined, comprising:
The table being queried or column are queried number in statistics preset time, and the table or column being queried described in judgement are queried Number whether be less than the absolute threshold, if the data of the table being queried or column are then determined as cold data, otherwise The data of the table being queried or column are determined as dsc data.
15. method as claimed in claim 10, which is characterized in that the heat degree threshold of the data access logic component includes opposite threshold Value;
According to the heat degree threshold of the data access logic component, the data temperature of the table or column that are queried is determined, comprising:
The table being queried in statistics preset time be queried number and all tables be queried number, be queried described in calculating Table the ratio for being queried number for being queried number Yu all tables, judge whether the ratio is less than the relative threshold, If the table that is queried described in then determining is cold data, otherwise determine described in the table that is queried be dsc data;
All column of the column being queried in statistics preset time being queried in table belonging to number and the column are queried Number, the ratio for being queried number for all column of the column being queried described in calculating being queried in table belonging to number and the column Value, judges whether the ratio is less than the relative threshold, if what is be queried described in then determining is classified as cold data, otherwise determines It is described be queried be classified as dsc data.
16. the method as described in claims 14 or 15, which is characterized in that
The data access logic component be random access model, be incremented by Access Model or successively decrease Access Model when, the preset time For nearest N number of unit time, the N is the integer greater than 0;
When the data access logic component is cycle access model, the preset time is nearest a cycle.
CN201710374717.6A 2017-05-24 2017-05-24 A kind of data hot statistics system and method Active CN108241725B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710374717.6A CN108241725B (en) 2017-05-24 2017-05-24 A kind of data hot statistics system and method
PCT/CN2018/088195 WO2018214936A1 (en) 2017-05-24 2018-05-24 Data popularity statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710374717.6A CN108241725B (en) 2017-05-24 2017-05-24 A kind of data hot statistics system and method

Publications (2)

Publication Number Publication Date
CN108241725A CN108241725A (en) 2018-07-03
CN108241725B true CN108241725B (en) 2019-07-05

Family

ID=62703160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710374717.6A Active CN108241725B (en) 2017-05-24 2017-05-24 A kind of data hot statistics system and method

Country Status (2)

Country Link
CN (1) CN108241725B (en)
WO (1) WO2018214936A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968564B (en) * 2018-09-28 2023-04-25 阿里巴巴集团控股有限公司 Data processing method and training method of data state prediction model
CN109684381A (en) * 2018-12-20 2019-04-26 恒生电子股份有限公司 Data hot statistics method and device
CN109976905B (en) * 2019-03-01 2021-10-22 联想(北京)有限公司 Memory management method and device and electronic equipment
CN109918575A (en) * 2019-03-29 2019-06-21 阿里巴巴集团控股有限公司 A kind of superseded method and apparatus of the data applied to search system
CN110990372A (en) * 2019-11-06 2020-04-10 苏宁云计算有限公司 Dimensional data processing method and device and data query method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN105718565A (en) * 2016-01-20 2016-06-29 北京京东尚科信息技术有限公司 Data warehouse model construction method and construction apparatus
CN106557552A (en) * 2016-10-27 2017-04-05 国家计算机网络与信息安全管理中心 A kind of network topics temperature Forecasting Methodology

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183393B (en) * 2007-12-21 2010-06-23 腾讯科技(深圳)有限公司 Information attention update method and device
CN104572778A (en) * 2013-10-27 2015-04-29 西安群丰电子信息科技有限公司 Service database query statistical method
CN104881369B (en) * 2015-05-11 2017-12-12 中国人民解放军国防科学技术大学 Towards the low memory cost hotspot data identification method of mixing storage system
CN105094700B (en) * 2015-07-15 2018-05-01 浪潮(北京)电子信息产业有限公司 The data temperature of bedding storage calculates method and apparatus in cloud storage system
KR101686346B1 (en) * 2015-09-11 2016-12-29 성균관대학교산학협력단 Cold data eviction method using node congestion probability for hdfs based on hybrid ssd

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150347A (en) * 2013-02-07 2013-06-12 浙江大学 Dynamic replica management method based on file heat
CN105718565A (en) * 2016-01-20 2016-06-29 北京京东尚科信息技术有限公司 Data warehouse model construction method and construction apparatus
CN106557552A (en) * 2016-10-27 2017-04-05 国家计算机网络与信息安全管理中心 A kind of network topics temperature Forecasting Methodology

Also Published As

Publication number Publication date
CN108241725A (en) 2018-07-03
WO2018214936A1 (en) 2018-11-29

Similar Documents

Publication Publication Date Title
CN108241725B (en) A kind of data hot statistics system and method
CN100596353C (en) Method and system for providing log service
CN104903894B (en) System and method for distributed networks database query engine
EP3446242B1 (en) Query plan generation and execution in a relational database management system with a temporal-relational database
Holzschuher et al. Performance of graph query languages: comparison of cypher, gremlin and native access in neo4j
CN103235820B (en) Date storage method and device in a kind of group system
Huang et al. Research on architecture and query performance based on distributed graph database Neo4j
JP4944160B2 (en) Method and apparatus for searching a plurality of real-time sensors
CN105160039A (en) Query method based on big data
CN101184106A (en) Associated transaction processing method of mobile database
CN104778188A (en) Distributed device log collection method
CN103207920A (en) Parallel metadata acquisition system
CN102333108A (en) Distributed cache synchronization system and method
CN102929899A (en) Distributed reporting system based on intermediate table
CN107506356A (en) Data processing method and its system
CN109586970B (en) Resource allocation method, device and system
CN102103633B (en) The method and system of infosystem performance is improved based on using forestland
CN105550351B (en) The extemporaneous inquiry system of passenger's run-length data and method
Ahmad et al. COLR-Tree: Communication-efficient spatio-temporal indexing for a sensor data web portal
CN107341249A (en) The storage of server info and extracting method and system, extraction element
CN102325098A (en) Group information acquisition method and system
Lee et al. A big data management system for energy consumption prediction models
CN105912621A (en) Area building energy consumption platform data storing and query method
Gorawski et al. Indexing spatial objects in stream data warehouse
Roshdy et al. Developing a RDB-RDF management framework for interoperable web environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant