CN112015741A - Method and device for storing massive data in different databases and tables - Google Patents

Method and device for storing massive data in different databases and tables Download PDF

Info

Publication number
CN112015741A
CN112015741A CN202011118667.3A CN202011118667A CN112015741A CN 112015741 A CN112015741 A CN 112015741A CN 202011118667 A CN202011118667 A CN 202011118667A CN 112015741 A CN112015741 A CN 112015741A
Authority
CN
China
Prior art keywords
data
database
index
value
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011118667.3A
Other languages
Chinese (zh)
Inventor
张宁
王�义
聂磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Wuyi Yuntong Network Technology Co ltd
Original Assignee
Wuhan Wuyi Yuntong Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Wuyi Yuntong Network Technology Co ltd filed Critical Wuhan Wuyi Yuntong Network Technology Co ltd
Priority to CN202011118667.3A priority Critical patent/CN112015741A/en
Publication of CN112015741A publication Critical patent/CN112015741A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for storing massive data in a database and a table, wherein the method comprises the following steps: determining the total number N of the data table and the total number M of the database; n is more than or equal to M, and each database contains N/M data tables; naming each database according to the database index, so that the database names of the M databases are increased progressively according to the database index sequence; naming each data table according to the table index, so that the table names of the N data tables are increased progressively according to the table index sequence; and selecting a splitting key field from the data table, determining a corresponding library index and a corresponding table index for each data based on the value of the splitting key field, and further storing the data into the corresponding database and the corresponding data table. The massive data are stored in the relational database in a structured mode through database and table dividing operation, the storage problem of the massive data is effectively solved, the maintenance cost of the massive data is reduced on the whole, the processing difficulty of the business data is reduced, and the utilization rate of database hardware is improved.

Description

Method and device for storing massive data in different databases and tables
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a device for storing massive data in a database and a table.
Background
At present, in a plurality of application fields such as commerce, finance and management, a relational database is generally used as a data storage medium when data processing is carried out by internet enterprises; for example, the supply chain financial platform and the e-commerce platform of the bulk B2B all use a relational database as a storage medium for data processing. The relational database is a database composed of a plurality of two-dimensional row-column tables which can be connected with each other, and currently, the mainstream relational databases include Oracle, DB2, Microsoft SQL Server, Microsoft Access, MySQL and the like. At present, most enterprises adopt relational databases such as MySQL and the like to store, process and analyze data information, but with the continuous development and growth of the enterprises, data accompanying businesses also increase day by day, so that the existing mainstream relational databases cannot meet the requirement of storing, analyzing and processing mass data information. For example, when a certain amount of data is accumulated in a data table, a large amount of hardware resources are consumed for querying the data table, the query time is increased, and if the combined query is performed at this time, even the whole database may be crashed or crashed.
In order to avoid the phenomenon that the long-term development of a main business is hindered due to the reduction of data query and analysis capacity caused by overlarge data, enterprises often need to invest huge cost to enhance the maintenance force of a database, and a plurality of ways are adopted to archive and optimize the mass data. However, this approach is not only inefficient, but also complicates the processing of the traffic data. Therefore, how to design a storage device capable of supporting mass data to meet the storage requirements of various businesses of an enterprise on the mass data, and at the same time, the storage device can help the enterprise reduce operation and maintenance cost, reduce the complexity of business data processing, and improve the hardware utilization rate of a database server, which is a problem to be solved urgently in the technical field.
Disclosure of Invention
The technical problems to be solved by the invention are as follows:
at present, a mainstream relational database gradually cannot meet the requirements of storage, analysis and processing of mass data information, so that enterprises usually need to invest huge cost to enhance the maintenance of the database, and various ways are adopted to archive and optimize the mass data, so that not only is the efficiency low, but also the processing of business data becomes more complicated and complicated.
The invention solves the technical problems through the following technical scheme:
in a first aspect, the present invention provides a method for storing mass data in a database and table, including:
determining the total number N of the data table and the total number M of the database; n is more than or equal to M, and each database contains N/M data tables;
naming each database according to the database index, so that the database names of the M databases are increased progressively according to the database index sequence; naming each data table according to the table index, so that the table names of the N data tables are increased progressively according to the table index sequence;
and selecting a splitting key field from the data table, determining a corresponding library index and a corresponding table index for each data based on the value of the splitting key field, and further storing the data into the corresponding database and the corresponding data table.
Preferably, for each data to be stored, the corresponding library index and table index are determined based on the value of the split key field, and then the data is stored in the corresponding database and data table, specifically:
determining a value X of a split key field in the data, and performing hash calculation on the value X of the split key field by adopting a hash algorithm to obtain a corresponding hash value H;
calculating a database index by a remainder calculation method based on the hash value H and the total number M of the database; calculating the table index by a remainder calculation method based on the hash value H and the total number N of the data table;
and determining the database and the data table where the data is located according to the calculated library index and table index, and further storing the data into the corresponding data table of the corresponding database.
Preferably, the library indexes of the M databases are gradually increased from 0, and the table indexes of the N data tables are gradually increased from 0;
calculating a database index by a remainder calculation method based on the hash value H and the total number M of the database, specifically: bin index = (H/P)% M; wherein, P = N/M,% represents remainder;
the table index is calculated by a remainder calculation method based on the hash value H and the total number N of the data table, and specifically comprises the following steps: table index = (H% N).
Preferably, the hash algorithm employs the murmurur 3 algorithm or the murmurur 2 algorithm.
Preferably, when a batch query of data is required, the method further comprises:
clustering values of keywords in the query statement so as to summarize values with the same database and data table coordinates together and finally obtain z groups;
splitting the query statement into z segmented query statements according to the clustering result, wherein each segmented query statement corresponds to one data table in one database;
and respectively issuing the z segmented query sentences to a data table in a corresponding database for execution.
Preferably, the value of the keyword in the query statement specifically adopts a hash value H, and the clustering process on the value of the keyword in the query statement specifically includes:
and determining a library index and a table index corresponding to each hash value in the query statement, and summarizing the hash values with the same library index and table index together to form a group.
Preferably, the value of M is the power M of 2, the value of N is the power N of 2, and N ≧ M.
Preferably, the total number N of the data tables and the total number M of the databases are configured according to the concurrency, data volume and throughput of the enterprise business.
Preferably, the naming method of the database is specifically as follows: the M databases are named in a mode of library name plus _; wherein, the library index is gradually increased from 0;
the naming method of the data table specifically comprises the following steps: the N data tables are named in a mode of ' table name ' + ' _ ' and ' table index ' + '; where the table index is incremented one by one starting from 0.
In a second aspect, the present invention provides a device for storing mass data in different banks and in different tables, including at least one processor and a memory, where the at least one processor and the memory are connected through a data bus, and the memory stores instructions executable by the at least one processor, and the instructions are used to complete the method for storing mass data in different banks and in different tables according to the first aspect after being executed by the processor.
Compared with the prior art, the invention has the beneficial effects that:
the data storage scheme provided by the invention realizes the database partitioning and table partitioning operation of the relational database based on the independent database and table partitioning algorithm, particularly disperses N data tables in M databases, selects a fixed splitting key field for the data tables, and positions each data to the specific database and the specific data table for storage based on the value of the splitting key field, thereby storing the mass data in the relational database in a structured manner and effectively solving the storage problem of the mass data. In fact, the traditional relational database is efficiently vertically split and horizontally split, so that the maintenance cost of an enterprise on mass data is reduced on the whole, the processing difficulty of business data is reduced, the utilization rate of database hardware is greatly improved, and the storage requirements of various businesses of the enterprise on mass data are met.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a flowchart of a method for storing mass data in different banks and tables according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a storage structure of a sub-database and sub-table of mass data according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a specific implementation of a database-dividing and table-dividing algorithm according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a core code for performing a hash calculation using the murmurur 3 algorithm according to an embodiment of the present invention;
fig. 5 is a schematic core code diagram of a banking algorithm according to an embodiment of the present invention;
FIG. 6 is a core code diagram of another library partitioning algorithm according to an embodiment of the present invention;
FIG. 7 is a core code diagram of a table-splitting algorithm according to an embodiment of the present invention;
FIG. 8 is a core code diagram of another table-splitting algorithm according to an embodiment of the present invention;
FIG. 9 is a flowchart of a batch query method for data according to an embodiment of the present invention;
fig. 10 is a schematic diagram of an SQL query statement according to an embodiment of the present invention;
fig. 11 is a schematic diagram of splitting an SQL query statement according to an embodiment of the present invention;
fig. 12 is an architecture diagram of a device for storing multiple databases and multiple tables according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
in order to solve the problem that the traditional scheme cannot effectively store the mass data, the embodiment of the invention provides a database-partitioning and table-partitioning storage method for the mass data based on a self-developed database-partitioning algorithm and a table-partitioning algorithm, and the method can be used for efficiently vertically splitting and horizontally splitting the traditional relational database.
In the technical field of databases, the storage and processing capabilities of single tables are limited, and once the storage and processing capabilities of single tables exceed the upper limit, the storage and processing capabilities of single tables are reduced, and different relational databases have different storage and processing capabilities of single tables, so that special processing needs to be performed according to different databases, but the storage methods of sub-databases and sub-tables are the same. Therefore, in order to more clearly illustrate the technical solution of the present invention, the most widely used MySQL database of internet enterprises will be taken as an example, but not limiting the present invention.
As shown in fig. 1, the storage method provided in the embodiment of the present invention mainly includes the following steps:
step 10, determining the total number N of the data table and the total number M of the database; wherein N is more than or equal to M, and each database can contain at most N/M data tables.
The step is a preparation work in advance for sub-database and sub-table storage, in order to effectively and evenly distribute N data tables in M databases and further to commonly exert the performance of each database, and in order to facilitate the expansion, the expansion and the maintenance of the algorithm, the value of M needs to be set to be the M power of 2, the value of N needs to be set to be the N power of 2, and N is larger than or equal to M, namely, M and N can be expressed by a plurality of powers of 2 to ensure that N/M can be divided.
The larger the N value and the M value are, the more the capability of a database single table can be fully utilized, but the maintenance cost is increased and the operation and maintenance difficulty is increased; on the contrary, the smaller the N value and the M value are, the lower the maintenance cost and the lower the operation and maintenance difficulty are, but the problem of single-meter performance bottleneck occurs. Therefore, the N value and the M value should be reasonably configured according to actual requirements during design, so that the performance index and the ease of maintenance reach an optimal balance. Generally, the total number N of the data tables and the total number M of the databases can be configured in an integrated tuning manner according to parameters such as concurrency, data volume, throughput and the like of enterprise services; for example, the larger the concurrency of enterprise traffic, the larger the data volume, and the larger the throughput, the larger the total number M of databases and the total number N of data tables. Of course, if the economic strength of the enterprise allows, the values of N and M may be set as large as possible, and are not particularly limited herein.
In one specific embodiment, the estimation of the N and M values is as follows:
estimation of the value of N: taking the case that the number of data lines in a single table should not exceed 2000 ten thousand, assuming that the data in a certain data table within a reasonable service time range does not exceed 12 hundred million lines, the value of N should be: n = 1200000000/20000000 = 60, and N =64 is finally estimated to satisfy the power N of 2.
Estimation of the M value: since N is larger than or equal to M and the value of M is the M power of 2, the value range of M is as follows: 2,4,8, 16, 32, 64. Assuming that the value M =2, two databases are required to accommodate N =64 data tables, i.e. each database will accommodate 32 data tables; assuming that the value M =16, 16 databases are needed to accommodate N =64 data tables, that is, each database will accommodate 4 data tables; assuming the value M =64, 64 databases would be required to hold N =64 data sheets, i.e. each database would hold 1 data sheet.
The embodiment of the present invention will be described with M =4 as an example, that is, 4 databases are used to accommodate 64 tables, i.e., M =4 and N =64, and each database will accommodate 16 tables.
Step 20, naming each database according to the library index, and increasing the library names of the M databases in sequence according to the library index; the data tables are named according to the table indexes, so that the table names of the N data tables are increased in the table index sequence.
When a plurality of tables are dispersed in different databases in the mainstream database-based table-dividing scheme in the market at present, the table names of the different databases are completely the same, for example, table _0, table _1 and table _2 are stored in database1, and then table _0, table _1 and table _2 are also stored in database2, and the database names must be taken to distinguish from each other during maintenance. Therefore, the difficulty of accessing the database by a business team is increased after the database is divided into the tables, and in order to reduce the complexity in the aspect, all N data tables are named according to the ascending sequence of the table indexes, so that the data tables can be completely distinguished only through the table names.
In a specific embodiment, the naming method of the database specifically includes: the M databases are named in a mode of library name plus _; since the bin indices are incremented one by one starting from 0, when M =4, the names of the 4 databases are: database _0, database _1, database _2, and database _ 3. Similarly, the naming method of the data table specifically comprises the following steps: the N data tables are named in a mode of ' table name ' + ' _ ' and ' table index ' + '; since the table indexes are incremented one by one from 0, when N =64, the names of 64 data tables are: table _0, table _1, table _2, …, table _ 63. Reference may be made specifically to fig. 2:
database _0 database contains data tables table _0, table _1, …, table _ 15;
database _1 database contains data tables table _16, table _17, …, table _ 31;
database _2 database contains data tables table _32, table _33, …, table _ 47;
the database _3 database contains data tables table _48, table _49, …, table _ 63.
And step 30, selecting the splitting key field from the data table, determining a corresponding library index and a corresponding table index for each data based on the value of the splitting key field, and further storing the data into the corresponding database and the corresponding data table.
In order to distribute massive data in different databases and data tables according to a certain rule or algorithm, the invention selects a fixed field in the data table and names the field as a split key field; such as an "age" field, a "name" field, etc. For each data in the mass data to be stored, based on the value of the split key field in the data, it is calculated in which database and which data table the record row (i.e. the data) where the split key field is located should fall, and then data storage is completed according to the calculation result. Referring to fig. 3, for each data to be stored, the specific sub-library and sub-table storage process includes the following steps:
step 301, determining a value X of the split key field in the data, and performing hash calculation on the value X of the split key field by using a hash algorithm to obtain a corresponding hash value H.
When the calculation of the sub-base and the sub-table is carried out, in order to avoid the phenomenon that the use of certain machine resources is tense and some data tables are quite idle due to the fact that some data tables have too much data and some data tables have too little data caused by the problem of data inclination, the hash algorithm is adopted to carry out hash calculation on the values of the split key fields, and the more the hash calculation is dispersed, the more the inclination of the data can be reduced; the hash algorithm may employ the murmurmur 3 algorithm or the murmur2 algorithm. Among them, the murmur3 algorithm is faster than the murmur2 algorithm, especially has higher balance and low collision rate to the massive data, so the murmur3 algorithm can be adopted preferentially. In addition, the murmur3 algorithm implements 32-bit and 128-bit HashKey versions, respectively, and since the 128-bit algorithm is optimized for the respective platforms, the x86 and x64 versions will produce different values when 128 bits are used. In the scheme, a 128-bit murmur3 algorithm is taken as an example, and values X of split key fields are taken as character strings "ABC", "123" and "ABC 123", respectively, and the hash calculation result is specifically as follows:
when the value X of the split key field is the character string 'ABC', the murmurmur 3 hash calculation is carried out on the character string 'ABC', and the obtained hash value is as follows: h = 8233032603788628002;
when the value X of the split key field is the character string "123", after the murmurmur 3 hash calculation is performed on the character string "123", the obtained hash value is: h = 7468325962851647638;
when the value X of the split key field is the character string 'ABC 123', after the character string 'ABC 123' is subjected to murmurmur 3 hash calculation, the obtained hash value is: h = 8221188151294809562.
Step 302, calculating a library index by a remainder calculation method based on the hash value H and the total number M of the database; and calculating the table index by a remainder calculation method based on the hash value H and the total number N of the data table.
Taking M =4 and N =64 as an example, the value of the split key field is X, the number of data tables of each database is P = N/M =16, and the hash value obtained by performing the murmurmur 3 algorithm on the split key value X is H = murmurmur 3 — 128 (X). In conjunction with fig. 2, it can be determined by the following database-splitting algorithm and table-splitting algorithm which record row with the value X of the split key field should fall into which data table of which database.
A database partitioning algorithm: bin index = (H/P)% M, i.e., [ murmurmur 3 — 128(X)/16 ]% 4;
and (3) a table division algorithm: table index = (H% N), i.e., murmurmur 3 — 128 (X)% 64;
where,% represents the remainder. Still taking the values X of the split key field as the character strings "ABC", "123", and "ABC 123", respectively, as an example, the results of the sub-library calculation and the sub-table calculation are shown in table 1:
table 1:
Figure 428886DEST_PATH_IMAGE001
step 303, determining the database and the data table where the data is located according to the calculated library index and table index, and further storing the data into the corresponding data table of the corresponding database.
After the database index and the table index corresponding to each data are obtained through the database dividing algorithm and the table dividing algorithm, the database and the data table which are stored by each data are determined. For example, from the calculation results in table 1, it can be seen that:
if the value X of the split key field is the string "ABC", then the record will fall into the data table _34 in database _ 2;
if the value of the split key field is string "123", then the record will fall in data table _22 in database _ 1;
if the value of the split key field is the string "ABC 123", then the record will fall in the data table _26 in database _ 1.
And the calculation results completely conform to the distribution of the database index and the data table index.
In order to enable the service code to more conveniently access the data sources after the sub-libraries and the sub-tables, the scheme implements and encapsulates the rules and algorithms in the step 30 on the codes based on the source code of the open source ShardingSphere, as shown in fig. 2; the service code only needs to access the database without index and the data table without index to access the desired data because the calculation of the library index and the table index is performed and routed completely according to the above rules and algorithms. Wherein, when using open source ShardingSphere, the algorithm of murmurur 3 will be implemented using open source of Google corporation, and the core code is shown in fig. 4; core codes of the banking algorithm are shown in fig. 5 and 6, and code implementation is performed around (H/P)% M; core code of the table division algorithm is shown in fig. 7 and 8, and code implementation is performed around H/% N.
It should be noted that, although the embodiment of the present invention uses the open source ShardingSphere framework for code implementation and encapsulation, the present invention is not limited thereto. In alternative embodiments, the code encapsulation may also be implemented by other similar middleware frameworks, such as mycat, and even self-developed frameworks, which are not specifically limited herein.
Further, in a preferred embodiment, in order to ensure the uniqueness and query performance of the data after being divided into the base and the table, a corresponding unique index, an aggregated index, a non-aggregated index and the like can be established for each data table according to the business rules, and a SnowFlake algorithm (SnowFlake) of Twitter company is used as a main key of the data tables.
In summary, the database-based and table-based algorithm autonomously invented by the embodiment of the present invention has the advantages of easy use, easy test, easy maintenance, high performance, and the like, and specifically the following steps:
the use is easy: after a certain table is divided into a plurality of data tables in the traditional scheme, the difficulty of accessing data by service personnel and maintenance personnel is increased, and when a developer operates the table in the scheme, the developer does not need to know which data table should be specifically operated, a database-dividing and table-dividing algorithm can automatically calculate and automatically position which specific database and which data table, the process is completely transparent and insensitive to the developer, and the use threshold can be effectively reduced.
Easy test: the database partitioning algorithm and the table partitioning algorithm provided by the scheme only need three input parameters (namely the number M of the databases, the number N of the data tables and the value X of the splitting key field), then the specific databases and the specific data tables can be rapidly calculated through a delicate algorithm, and testers can conveniently and easily test the algorithm by means of a common calculator, Excel and other tools.
Easy maintenance: when a plurality of tables are dispersed in different databases by a mainstream database-based and table-based scheme in the market at present, the table names of the different databases are completely the same, and the database names are required to be taken to be distinguished during maintenance.
High performance: through test code testing, 2 hundred million times of algorithm calculation is carried out on values of different split key fields, only 1.5 seconds are consumed, and the whole operation process is more efficient.
The invention is based on an open source framework and a database-dividing and table-dividing algorithm independently invented, and carries out efficient vertical division and horizontal division on a traditional relational database, particularly, N data tables are dispersed in M databases, fixed division key fields are selected for the data tables, and each datum is positioned to a specific database and the data table for storage based on the value of the division key fields, so that the mass data is stored in the relational database in a structured mode, the storage problem of the mass data is effectively solved, the maintenance cost of an enterprise on the mass data is reduced on the whole, the processing difficulty of service data is reduced, the utilization rate of database hardware is greatly improved, and the storage requirements of various businesses of the enterprise on the mass data are met.
Example 2
When the traditional scheme is used for carrying out batch query on data through a query statement, namely when the query with the IN keywords is executed, all value lists corresponding to the IN keywords are sent to different databases for execution. However, it is the case that some values in these value lists are only stored in the a 'table of the a database and should be searched in the a' table of the a database, and some values are only stored in the B 'table of the B database and should be searched in the B' table of the B database. However, when such batch data query is performed, the conventional scheme does not perform additional clustering processing, but simply and roughly distributes the original SQL query statement to each database for execution, which wastes resources and reduces execution efficiency.
To solve the above problem, on the basis of the database-based and table-based data storage in embodiment 1, an embodiment of the present invention further provides a method for batch query of data, as shown in fig. 9, when batch query of data is required, the query is specifically completed according to the following steps:
step 401, clustering the values of the keywords in the query statement so as to summarize the values with the same database and data table coordinates together, and finally obtaining z groups.
Wherein, the value of the keyword in the query statement specifically adopts a hash value H calculated by correspondingly splitting the key value X, and then the clustering process specifically comprises: and determining a library index and a table index corresponding to each hash value H in the query statement, and summarizing the hash values with the same library index and table index together to form a group. This is because when the bin index and the table index calculated corresponding to the hash value H are the same, it indicates that the data corresponding to the hash values are stored in the same data table of the same database, i.e. have the same database and data table coordinates, and thus can be divided into one group. That is, a packet contains values of one or more keys, and the values are located in the same data table corresponding to the represented data.
Step 402, splitting the query statement into z segmented query statements according to the clustering result, wherein each segmented query statement corresponds to a data table in a database.
Here, splitting the query statement mainly splits the value list of the keywords, that is, the values of the keywords divided into the same group in step 401 are split into the same segmented query statement, so as to obtain z segmented query statements.
And 403, respectively issuing the z segmented query sentences to a data table in a corresponding database for execution. Therefore, the distribution and execution of the targeted query statement are finished, so that more things can be done by using less database resources, and the overall execution efficiency is improved.
The following describes the above query process by using a specific embodiment:
taking the MySQL database as an example, assuming that the SQL query statement of the current service code is as shown in fig. 10, the logic _ table represents a logic table, i.e., a data table in the embodiment of the present invention; the parenthesis following the IN key represents a list of values, namely the hash value H. Taking the open source ShardingSphere framework as an example, when M =4 and N =64, and if the method of splitting the query statement is not used, the ShardingSphere distributes the query statement to the database _0 library, the database _1 library, the database _2 library and the database _3 library for execution, and the executed statement is the statement shown in fig. 10. However, by way of example of the foregoing algorithm:
splitting a key value X: 8233032603788628002 (corresponding to hash value H) under table _34 in database _ 2;
splitting a key value X: 7468325962851647638 (corresponding to hash value H) under table _22 in database _ 1;
splitting a key value X: 8221188151294809562 (corresponding to hash value H) is recorded under table _26 in database _ 1.
Then, if the distribution and execution of the SQL query statement are performed according to the conventional method, obviously some values do not exist IN a specific database and data table, the query range is expanded, the waste of performance increases as the number of IN keyword post-value lists increases, and the efficiency tends to decrease.
By the method provided by the embodiment of the invention, the SQL query statement can be divided into a plurality of different items, and the specific method is to cluster the value list after the IN keyword, so that the SQL query statement can be accurately distributed and executed, the query speed is accelerated, and the overall performance is improved. Taking the SQL query statement shown in fig. 10 as an example, according to the difference between the library index and the table index corresponding to the hash value H in the value list of the keyword, the SQL query statement is split into three segmented SQL query statements as shown in fig. 11.
Wherein, the first segmented SQL query statement is only sent to table _34 in database _2 for execution; the second segmented SQL query statement is only sent to table _22 in database _1 for execution; the third segmented SQL query statement is sent to only the table _26 in the data database _1 for execution. Thus, the overall execution efficiency is greatly improved compared with that of the native Shardingsphere.
IN summary, by the method provided by the embodiment of the present invention, when performing batch query of data, the value list of the IN keyword can be analyzed and summarized accurately, and clustering is performed according to different data sources, so that the original query statement is divided into different pieces, and different values are located IN the data source and the data table where they should be located accurately, thereby greatly reducing the query range and speeding up the query performance.
Example 3:
on the basis of the method for storing the database and the table of the mass data provided in the foregoing embodiment 1 and embodiment 2, the present invention further provides a device for storing the database and the table of the mass data, which is capable of implementing the method described above, and as shown in fig. 12, the device is schematically configured in the embodiment of the present invention. The device for storing the sub-database and sub-table of the mass data of the embodiment includes one or more processors 21 and a memory 22. In fig. 12, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 12 illustrates the connection by a bus as an example.
The memory 22 is used as a non-volatile computer-readable storage medium for storing a sub-database and sub-table storage method of mass data, and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as the sub-database and sub-table storage method of mass data in embodiment 1. The processor 21 executes various functional applications and data processing of the device for storing the mass data in the database and table, namely, implements the method for storing the mass data in the database and table according to embodiments 1 and 2, by operating the nonvolatile software program, the instructions and the modules stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the method for storing the mass data in the sub-library and sub-table in embodiment 1, for example, perform the steps shown in fig. 1, fig. 3, and fig. 9 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for storing massive data in a database and a table is characterized by comprising the following steps:
determining the total number N of data tables corresponding to all databases and the total number M of the databases; n is more than or equal to M, and each database contains N/M data tables;
naming each database according to the database index, so that the database names of the M databases are increased progressively according to the database index sequence; naming each data table according to the table index, so that the table names of the N data tables are increased progressively according to the table index sequence;
selecting a splitting key field from a data table, determining a corresponding library index and a table index for each data based on the value of the splitting key field, and further storing the data into a corresponding database and a corresponding data table; the method specifically comprises the following steps:
determining a value X of a split key field in the data, and performing hash calculation on the value X of the split key field by adopting a hash algorithm to obtain a corresponding hash value H; calculating a database index by a remainder calculation method based on the hash value H and the total number M of the database; calculating the table index by a remainder calculation method based on the hash value H and the total number N of the data table; determining a database and a data table where the data are located according to the calculated library index and table index, and further storing the data into a corresponding data table of a corresponding database;
the database indexes of the M databases are gradually increased from 0, and the table indexes of the N data tables are gradually increased from 0;
calculating a database index by a remainder calculation method based on the hash value H and the total number M of the database, specifically: bin index = (H/P)% M; wherein, P = N/M,% represents remainder;
the table index is calculated by a remainder calculation method based on the hash value H and the total number N of the data table, and specifically comprises the following steps: table index = (H% N).
2. The method for sub-warehouse and sub-table storage of mass data according to claim 1, wherein the hash algorithm is murmurur 3 algorithm or murmurur 2 algorithm.
3. The method for sub-base and sub-table storage of mass data according to claim 1, wherein when batch query of data is required, the method further comprises:
clustering values of keywords in the query statement so as to summarize values with the same database and data table coordinates together and finally obtain z groups;
splitting the query statement into z segmented query statements according to the clustering result, wherein each segmented query statement corresponds to one data table in one database;
and respectively issuing the z segmented query sentences to a data table in a corresponding database for execution.
4. The method according to claim 3, wherein the hash value H is used as the value of the keyword in the query statement, and the clustering process performed on the value of the keyword in the query statement specifically comprises:
and determining a library index and a table index corresponding to each hash value in the query statement, and summarizing the hash values with the same library index and table index together to form a group.
5. The method for storing the mass data according to any one of claims 1 to 4, wherein the value of M is the M power of 2, the value of N is the N power of 2, and N is greater than or equal to M.
6. The method for storing the databases according to any of claims 1 to 4, wherein the total number N of the data tables and the total number M of the databases are configured according to the concurrency, data volume, and throughput of the enterprise services.
7. The method for sub-base and sub-table storage of mass data according to any one of claims 1 to 4, wherein the naming method of the database is specifically as follows: the M databases are named in a mode of library name plus _; wherein, the library index is gradually increased from 0;
the naming method of the data table specifically comprises the following steps: the N data tables are named in a mode of ' table name ' + ' _ ' and ' table index ' + '; where the table index is incremented one by one starting from 0.
8. A device for storing mass data in different banks and tables, comprising at least one processor and a memory, wherein the at least one processor and the memory are connected through a data bus, and the memory stores instructions executable by the at least one processor, and the instructions are used for completing the method for storing mass data in different banks and tables according to any one of claims 1 to 7 after being executed by the processor.
CN202011118667.3A 2020-10-19 2020-10-19 Method and device for storing massive data in different databases and tables Pending CN112015741A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011118667.3A CN112015741A (en) 2020-10-19 2020-10-19 Method and device for storing massive data in different databases and tables

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011118667.3A CN112015741A (en) 2020-10-19 2020-10-19 Method and device for storing massive data in different databases and tables

Publications (1)

Publication Number Publication Date
CN112015741A true CN112015741A (en) 2020-12-01

Family

ID=73528031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011118667.3A Pending CN112015741A (en) 2020-10-19 2020-10-19 Method and device for storing massive data in different databases and tables

Country Status (1)

Country Link
CN (1) CN112015741A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380276A (en) * 2021-01-15 2021-02-19 四川新网银行股份有限公司 Method for querying data by non-fragment key fields after database division and table division of distributed system
CN112612777A (en) * 2020-12-24 2021-04-06 浙江大学 MySQL database-based marine data management and visualization system and method
CN112632066A (en) * 2020-12-18 2021-04-09 平安普惠企业管理有限公司 Automatic sub-table data access method and device, electronic equipment and storage medium
CN113032439A (en) * 2021-05-20 2021-06-25 广州宸祺出行科技有限公司 Method, device and system for querying database with database-to-table structure
CN113094262A (en) * 2021-03-29 2021-07-09 四川新网银行股份有限公司 Method for testing production data based on database sub-base sub-table
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
CN114661687A (en) * 2022-03-06 2022-06-24 中电万维信息技术有限责任公司 Distributed file storage method based on Shardingsphere and Fastdfs

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632066A (en) * 2020-12-18 2021-04-09 平安普惠企业管理有限公司 Automatic sub-table data access method and device, electronic equipment and storage medium
CN112632066B (en) * 2020-12-18 2023-08-25 湖北华中电力科技开发有限责任公司 Automatic sub-table data access method and device, electronic equipment and storage medium
CN112612777A (en) * 2020-12-24 2021-04-06 浙江大学 MySQL database-based marine data management and visualization system and method
CN112612777B (en) * 2020-12-24 2023-12-12 浙江大学 Ocean data management and visualization system and method based on MySQL database
CN112380276A (en) * 2021-01-15 2021-02-19 四川新网银行股份有限公司 Method for querying data by non-fragment key fields after database division and table division of distributed system
CN113094262A (en) * 2021-03-29 2021-07-09 四川新网银行股份有限公司 Method for testing production data based on database sub-base sub-table
CN113094262B (en) * 2021-03-29 2022-10-18 四川新网银行股份有限公司 Method for testing production data based on database and table division
CN113238993A (en) * 2021-05-14 2021-08-10 中国人民银行数字货币研究所 Data processing method and device
CN113238993B (en) * 2021-05-14 2023-12-05 中国人民银行数字货币研究所 Data processing method and device
CN113032439A (en) * 2021-05-20 2021-06-25 广州宸祺出行科技有限公司 Method, device and system for querying database with database-to-table structure
CN114661687A (en) * 2022-03-06 2022-06-24 中电万维信息技术有限责任公司 Distributed file storage method based on Shardingsphere and Fastdfs
CN114661687B (en) * 2022-03-06 2024-04-09 中电万维信息技术有限责任公司 Distributed file storage method based on SharingSphere and Fastdfs

Similar Documents

Publication Publication Date Title
CN112015741A (en) Method and device for storing massive data in different databases and tables
US20220405284A1 (en) Geo-scale analytics with bandwidth and regulatory constraints
CN106897322B (en) A kind of access method and device of database and file system
CN107943952B (en) Method for realizing full-text retrieval based on Spark framework
US6801903B2 (en) Collecting statistics in a database system
US20170083573A1 (en) Multi-query optimization
EP3014488B1 (en) Incremental maintenance of range-partitioned statistics for query optimization
US8396852B2 (en) Evaluating execution plan changes after a wakeup threshold time
US9471710B2 (en) On-the-fly encoding method for efficient grouping and aggregation
US7676453B2 (en) Partial query caching
US10565201B2 (en) Query processing management in a database management system
US11556534B2 (en) Subquery predicate generation to reduce processing in a multi-table join
US8935233B2 (en) Approximate index in relational databases
US20180276264A1 (en) Index establishment method and device
US11645281B1 (en) Caching query plans in database systems
US9229969B2 (en) Management of searches in a database system
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
Arnold et al. HRDBMS: Combining the best of modern and traditional relational databases
CN112818010B (en) Database query method and device
CN113486023A (en) Database and table dividing method and device
Wu et al. PABIRS: A data access middleware for distributed file systems
JPH10269225A (en) Data base dividing method
US11586604B2 (en) In-memory data structure for data access
Zhu et al. Optimization of generic progressive queries based on dependency analysis and materialized views
Fritchey Index Architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination