CN111259012B - Data homogenizing method, device, computer equipment and storage medium - Google Patents

Data homogenizing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN111259012B
CN111259012B CN202010066910.5A CN202010066910A CN111259012B CN 111259012 B CN111259012 B CN 111259012B CN 202010066910 A CN202010066910 A CN 202010066910A CN 111259012 B CN111259012 B CN 111259012B
Authority
CN
China
Prior art keywords
data
region
key
value
rowkey
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010066910.5A
Other languages
Chinese (zh)
Other versions
CN111259012A (en
Inventor
郑金伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010066910.5A priority Critical patent/CN111259012B/en
Publication of CN111259012A publication Critical patent/CN111259012A/en
Application granted granted Critical
Publication of CN111259012B publication Critical patent/CN111259012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data homogenizing method, a data homogenizing device, computer equipment and a storage medium. The method comprises the following steps: acquiring all primary key contents of a data table; determining the number of regions corresponding to each primary key content according to the maximum data storage amount of the regions and the total data amount corresponding to each primary key content; acquiring a RowKey data set corresponding to the primary key content; after extracting a first key value from each RowKey in the RowKey data set, sequencing each primary key data corresponding to the primary key content according to the first key value to obtain a sequence value; determining the boundary range of each Region corresponding to the main key content from all main key data according to the number and sequence values of the regions corresponding to the main key content; and acquiring a first key value of the primary key data in the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to the Region for storage when confirming that the size of the first key value is within the boundary range of the Region. The invention can homogenize the data in the data table in each Region.

Description

Data homogenizing method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data storage, and in particular, to a data homogenizing method, apparatus, computer device, and storage medium.
Background
Region is the minimum unit of HBase cluster distribution, and HBase distributes Region owned by each business table on each Region Server. In general, a service data table has only one corresponding Region at the beginning, but as the data volume of the service data table increases, and when the data volume reaches a threshold value, the Region is continuously split into a plurality of regions, if the data volume of the service data table is more, the phenomenon that the data of each service data table is unevenly distributed in the Region is caused, and pressure is caused on the Region server associated with the Region, so that the read-write performance of HBase associated with the Region server is affected. And the frequent splitting operation of Region also greatly occupies the resources of the server where the HBase cluster is located, thereby affecting the read-write performance of the HBase. Therefore, there is a need to find a solution to the above mentioned problems.
Disclosure of Invention
Accordingly, in order to solve the above-mentioned problems, it is necessary to provide a data homogenizing method, apparatus, computer device and storage medium for homogenizing data in a data table in each Region, so as to improve the read-write performance of the Region server corresponding to the Region.
A method of data homogenization, comprising:
setting the maximum data storage amount of Region in the HBase cluster as a preset data threshold;
receiving a data homogenizing instruction for a data table in the HBase cluster, and acquiring all primary key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
determining the number of regions corresponding to each primary key content in the data table according to the preset data threshold and the total data amount corresponding to each primary key content;
acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
after extracting a first key value from each RowKey in the RowKey data set corresponding to the main key content, sorting each main key data corresponding to the main key content according to the first key value, and obtaining a sequence value after sorting;
determining two boundary elements of each Region corresponding to the main key content from all main key data corresponding to the main key content according to the number of regions corresponding to the main key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the main key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
And acquiring the first key value of the main key data corresponding to the main key content, and distributing the main key data corresponding to the first key value in the main key content to a Region corresponding to the main key content for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region.
A data homogenizing device, comprising:
the setting module is used for setting the maximum data storage amount of Region in the HBase cluster as a preset data threshold;
the first acquisition module is used for receiving a data homogenization instruction of a data table in the HBase cluster and acquiring all primary key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
the first determining module is used for determining the number of regions corresponding to each main key content in the data table according to the preset data threshold and the total data amount corresponding to each main key content;
the second acquisition module is used for acquiring a RowKey data set corresponding to the main key content in each data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
The third acquisition module is used for extracting a first key value from each RowKey in the RowKey data set corresponding to the main key content, sorting each main key data corresponding to the main key content according to the first key value, and acquiring a sequence value after sorting;
the second determining module is used for determining two boundary elements of each Region corresponding to the main key content from all main key data corresponding to the main key content according to the number of regions corresponding to the main key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the main key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
and the distribution module is used for acquiring the first key value of the main key data corresponding to the main key content, and distributing the main key data corresponding to the first key value in the main key content to a Region corresponding to the main key content for storage when the acquired first key value is confirmed to be within the boundary range of the Region.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above data homogenization method when executing the computer program.
A computer readable storage medium storing a computer program which, when executed by a processor, implements the data homogenization method described above.
According to the data homogenizing method, the device, the computer equipment and the storage medium, the number of regions is limited by presetting the data threshold and the total data volume of each primary key content in the data table, so that frequent unlimited splitting operation on the regions is prevented, a large amount of resources of the HBase cluster are prevented from being occupied, the whole read-write performance of the HBase cluster is influenced, and therefore the stability of the HBase cluster is also improved; main key data corresponding to a first key value in the storage data corresponding to the main key content is distributed to each Region for storage, so that the main key data corresponding to the main key content in the data table is basically averaged to each Region, the operation load pressure of the Region server corresponding to the Region is reduced, the operation speed of the Region server is improved, and the read-write performance of the Region server is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application environment of a data homogenizing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data homogenizing method in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of the data homogenizing method step S40 in an application environment according to an embodiment of the present invention;
FIG. 4 is a flowchart of a data homogenizing method in an application environment after step S70 in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating a data homogenizing apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer device in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The data homogenizing method provided by the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server through a network. The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a data homogenizing method is provided, which is illustrated by using the server in fig. 1 as an example, and includes the following steps:
s10, setting the maximum data storage amount of Region in the HBase cluster as a preset data threshold;
as can be appreciated, HBase (HadoopDatabase) is a highly reliable, high performance, column-oriented, scalable distributed storage system, and a large-scale structured storage cluster (HBase cluster) can be built on a server by using HBase technology; the preset data threshold is a data storage amount representing a Region (a data partition), and the preset data threshold can be set according to requirements, for example, the preset data threshold can be 256M, and the server can also automatically set the maximum data storage amount of the Region as the preset data threshold.
S20, receiving a data homogenizing instruction for a data table in the HBase cluster, and acquiring all main key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
it is understood that the data table includes a plurality of primary key contents, each primary key content is associated with a total data amount of one type of storage data (the total data amount that the primary key content should occupy in the data table has been preset in the data table), for example, the primary key content may be a name, a telephone number, an identity card, etc., so that the primary key data in the primary key content associated one type of storage data may be a specific name, a specific telephone number, and a specific identity card number, and the data table has specified how much data capacity (total data amount) the specific name, the specific telephone number, and the specific identity card number should occupy in the data table.
S30, determining the number of regions corresponding to each primary key content in the data table according to the preset data threshold and the total data amount corresponding to each primary key content;
It can be appreciated that the total data amount of one primary key content may be used to determine the number of regions corresponding to each primary key content, for example, if the total data amount of one primary key content is 2560M, the preset data threshold is 256M, the number of regions is determined to be 10, if the total data amount of one primary key content is 500M, the preset data threshold is 256M, the number of regions is determined to be 2 (when the total data amount is not divisible by the preset data threshold, the remainder may be rounded, and the number of regions of an integer is determined); one primary key content may correspond to at least one Region (the number of corresponding regions may be determined by the total amount of data corresponding to the primary key content, where the greater the total amount of data, the greater the number of corresponding regions).
S40, acquiring a RowKey data set corresponding to each main key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
understandably, hash is to convert an input of arbitrary length (also called pre-map) into an output of fixed length by a hashing algorithm (the detailed preset Hash algorithm can see steps S401 to S403); and each piece of main key data corresponding to each piece of main key content can calculate the RowKey through a preset Hash algorithm, and when a plurality of pieces of main key data calculate the RowKey, a RowKey data set consisting of the RowKey can be obtained.
S50, after extracting a first key value from each RowKey in the RowKey data set corresponding to the main key content, sorting each main key data corresponding to the main key content according to the first key value, and obtaining a sequence value after sorting;
it can be understood that, each primary key data in the class of storage data corresponding to each primary key content can be obtained through a preset Hash algorithm, that is, each RowKey can be associated with a first key value (when the first key value has numbers and letters at the same time, only letters can be ordered by means of dictionary ordering, or only numbers are ordered), and when the first key value has only numbers, the first key value also has a numerical value, so that each primary key data associated with the RowKey can be ordered according to the numerical value of the first key value, and the above-mentioned sequence values are marks (such as the sequences can be 1, 2, 3, etc.) corresponding to the first key value after the ordering is completed.
S60, determining two boundary elements of each Region corresponding to the main key content from all main key data corresponding to the main key content according to the number of regions corresponding to the main key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the main key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
It will be appreciated that there is a boundary range of the same size for each Region, and thus two boundaries for each Region (i.e., each Region has a front boundary and a rear boundary, except that the front and rear boundary values of each Region are not identical); the boundary element is specifically represented by primary key data corresponding to the first key value; specifically, since all the primary key contents correspond to all regions, and the primary key data in the class of storage data corresponding to the primary key contents can determine two boundary elements of each Region according to the sequence value of the first key value of the RowKey corresponding to the primary key data, the boundary elements can determine the front boundary value and the rear boundary value of each Region corresponding to the primary key contents, for example, 100 primary key data exist in total in one primary key content, the first key value of the RowKey corresponding to 100 primary key data is sequenced according to the sequence value, the number of the determined regions is 10, and according to the principle that the regions are evenly distributed with data, the first key value of the RowKey corresponding to the primary key data can be determined from the 10 primary key data, and the boundary of the first key value of the RowKey corresponding to the primary key data can be determined from the 10 primary key data, and the boundary element with the largest sequence value of the first key value of the RowKey corresponding to the primary key data can be determined from the first key value of the sequence value of the primary key data, and the boundary element with the largest sequence value of the first key value of the sequence value of the primary key data can be determined from the first key element with the largest sequence value of the first key value of the Region and the smallest sequence value; it should be noted that, since the number of regions is determined by the preset data threshold and the total data amount corresponding to the content of all primary keys, when the total data amount is not divided by the preset data threshold, the data amount which is the same as the data amount of all previous regions cannot be stored in the last Region, but the data amount which is different from the previous regions in the last Region is small, so that the balanced load of the HBase cluster cannot be affected.
And S70, acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region corresponding to the primary key content for storage when the acquired first key value is confirmed to be within the boundary range of the Region.
Understandably, each primary key content corresponds to a RowKey dataset, and the RowKey dataset is composed of rowkeys, so each RowKey is associated with a first key value; the boundary range of the determined Region is composed of a front boundary value and a rear boundary value, wherein the front boundary value and the rear boundary value are preset key values corresponding to the preset RowKey, so that the first key value can be compared with the front boundary value and the rear boundary value; and when the first key value is larger than the front boundary value and smaller than the rear boundary value, the primary key data corresponding to the first key value can be distributed to the Region corresponding to the boundary range for storage.
In the embodiment of steps S10 to S70, the number of regions is limited by presetting a data threshold and the total data amount of each primary key content in the data table, so that frequent unlimited splitting operation on regions is prevented, a large amount of resources of the HBase cluster are prevented from being occupied, the overall read-write performance of the HBase cluster is affected, and therefore the stability of the HBase cluster is also improved; main key data corresponding to a first key value in the storage data corresponding to the main key content is distributed to each Region for storage, so that the main key data corresponding to the main key content in the data table is basically averaged to each Region, and therefore the operation load pressure of a Region server (the most important component in HBase and responsible for actual reading and writing of data and managing the Region) corresponding to the Region is reduced, and the operation speed of the Region server is improved, namely the reading and writing performance of the Region server is improved.
Further, as shown in fig. 3, the obtaining, by using a preset Hash algorithm, a RowKey dataset corresponding to each of the primary key contents in the data table includes:
s401, obtaining HashKey of each main key data in each main key content by applying Hash to each main key data in the class of storage data corresponding to each main key content in the data table, and extracting each second key value corresponding to each HashKey;
s402, acquiring character lengths of all primary key data in all primary key contents, and respectively calculating all second key values and all character lengths by using an absolute value algorithm to obtain all third key values;
s403, acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters and corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character with one primary key data in one primary key content through a preset symbol.
It is understood that the WORDS in the preset character distribution table is limited to 52 characters (26 English letters are arranged twice, and a to Z are again a to Z), and NUMBERS in the preset character distribution table is limited to 0 to 9 digits, so that the obtained RowKey is limited to the preset distribution range, and the subsequent query and management of all RowKey are facilitated.
Specifically, firstly, the Hash key can be obtained from each main key data in each main key content through Hash, a corresponding Hash function exists in the Hash, the Hash function can be selected according to the requirement, if one main key data in the main key content is a telephone number, the telephone number is 13700000000, the Hash key= -1088245419 is obtained through a Hash algorithm, and at the moment, -1088245419 is the second key value; then, obtaining the character length of each primary key data of the primary key content, namely the length of WORS and NUMBERS in the primary key data, wherein at the moment, the absolute value of the remainder (preset absolute value algorithm operation) is obtained after each second key value is divided with each character length, if the length of WORS and NUMBERS is (52, 10), the above mentioned HashKey is respectively divided with (52, 10) and the absolute value of the remainder is obtained, and at the moment, the third key value is (27,9); and finally, the third key value (namely, the special hash value calculated by WORS and NUMBERS) is obtained to preset WORS and NUMBERS in the character distribution table to find corresponding characters, such as the obtained (27,9) above, at this time, WORS in the character distribution table can be removed to find 27 th characters, NUMBERS can be removed to find 9 th characters, and the extracted characters are spliced with the main key content through preset symbols to obtain RowKey (each calculated RowKey can form a RowKey data set), for example, the 27 th character is assumed to be a, the 9 th character is assumed to be 8, the main key content is 13700000000, and therefore the RowKey obtained after the splicing is completed is a8_13700000000.
Further, as shown in fig. 4, after the step S70, the method further includes:
s701, acquiring a RowKey to be searched in the RowKey data set corresponding to the main key content, and determining a target Region stored in the HBase cluster by the main key data corresponding to the RowKey according to the first key value corresponding to the RowKey;
s702, locating at least one Region server with a scheduling relation with the determined target Region, after establishing communication connection with the Region server, reading primary key data corresponding to the RowKey through the Region server, and writing the primary key data read by the Region server into a local preset data table through communication connection.
In this embodiment, after the Region corresponding to the Region can be found by the RowKey, communication connection with the Region server is established, and primary key data related to the RowKey can be quickly read in the Region by the Region server corresponding to the Region.
Further, after the step S702, the method further includes:
the operation state of the Region servers participating in the read-write task is monitored in real time, and when one of the operation states of the Region servers is down, the target Region associated with the down Region server is obtained;
And enabling the Region to release the scheduling relation with the Region server which is down, and establishing the scheduling relation between the target Region and the Region server with normal at least one running state according to a preset scheduling rule.
The embodiment is to establish a scheduling relationship between a Region and a Region server without problems when the running state of the Region server is problematic, so as to ensure smooth progress of query data, i.e. ensure query efficiency.
Further, after the step S702, the method further includes:
judging whether the load degree of each region server exceeds a preset load degree;
and if the load degree of one of the region servers exceeds the preset load degree, stopping the operation tasks associated with the region servers exceeding the preset load degree, marking the region servers exceeding the preset load degree as a shunt point, and distributing the operation tasks corresponding to the shunt point and not exceeding the preset load degree to other region servers.
Specifically, if the load degree of one of the Region servers exceeds the preset load degree and then the Region servers run, the operation task associated with the Region servers exceeding the preset load degree can be stopped, if the load degree of one of the Region servers exceeds the preset load degree, the operation task exceeding the preset load degree is distributed to other Region servers not exceeding the preset load degree (wherein, the operation task can be distributed to at least one Region server having a scheduling relationship with the same Region, and when one Region server having a scheduling relationship with the same Region does not exist, the operation task can be distributed to the Region servers having no scheduling relationship with the same Region, and then the scheduling relationship between the Region and the Region servers is established.
In this embodiment, when one of the regionservers exceeds a preset load level, the running tasks of the RegionServer are shunted to the idle regionservers, so that the normal running of all the regionservers is ensured.
Further, after the step S702, the method further includes:
and acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and optimizing a transmission path for communication connection between the Region servers in the process of communication connection between the Region servers corresponding to the target Region.
Understandably, the transmission path optimization process includes, but is not limited to, increasing throughput of the region server, increasing bandwidth of the communication connection, and the like.
In this embodiment, when the number of accesses reaches a preset number threshold, the Region is determined to be an access hotspot, and in the process of performing communication connection between the server and the Region server, the Region server is enabled to optimize the transmission path of the communication connection (by releasing the running memory of the Region server, which is not related to the transmission at this time), so that the transmission rate can be increased, and the transmission time can be saved.
In summary, the above-mentioned method for homogenizing data limits the number of regions by presetting a data threshold and the total data amount of each primary key content in a data table, prevents frequent unlimited splitting operations on regions, and prevents a large amount of resources of the HBase cluster from being occupied, thereby affecting the overall read-write performance of the HBase cluster, and thus the stability of the HBase cluster is also improved; main key data corresponding to a first key value in the storage data corresponding to the main key content is distributed to each Region for storage, so that the main key data corresponding to the main key content in the data table is basically averaged to each Region, the operation load pressure of the Region server corresponding to the Region is reduced, the operation speed of the Region server is improved, and the read-write performance of the Region server is improved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
In an embodiment, a data homogenizing device is provided, where the data homogenizing device corresponds to the data homogenizing method in the foregoing embodiment one by one. As shown in fig. 5, the data uniformizing device includes a setting module 11, a first acquisition module 12, a first determination module 13, a second acquisition module 14, a third acquisition module 15, a second determination module 16, and an allocation module 17. The functional modules are described in detail as follows:
A setting module 11, configured to set a maximum data storage amount of regions in the HBase cluster as a preset data threshold;
a first obtaining module 12, configured to receive a data homogenizing instruction for a data table in the HBase cluster, and obtain all primary key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
a first determining module 13, configured to determine, according to the preset data threshold and the total data amount corresponding to each primary key content, the number of regions corresponding to each primary key content in the data table;
a second obtaining module 14, configured to obtain a RowKey dataset corresponding to each of the primary key contents in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
a third obtaining module 15, configured to extract a first key value from each of the rowkeys in the RowKey data set corresponding to the primary key content, sort each primary key data corresponding to the primary key content according to the first key value, and obtain a sequence value after sorting;
A second determining module 16, configured to determine, from all primary key data corresponding to the primary key content, two boundary elements of each Region corresponding to the primary key content according to the number of regions and the sequence value, and determine a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the primary key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
and the allocation module 17 is configured to obtain the first key value of the primary key data corresponding to the primary key content, and allocate the primary key data corresponding to the first key value in the primary key content to a Region corresponding to the primary key content for storage when confirming that the obtained first key value is within a boundary range of the Region.
Further, the second acquisition module includes:
the extraction submodule is used for obtaining the HashKey of each main key data in each main key content by applying Hash to each main key data in the class of storage data corresponding to each main key content in the data table, and extracting each second key value corresponding to each HashKey;
The operation sub-module is used for acquiring the character length of each primary key data in each primary key content, and respectively operating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
the splicing sub-module is used for acquiring the characters corresponding to the third key values from a preset character distribution table, and splicing the acquired characters and the corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character with one primary key data in one primary key content through a preset symbol.
Further, the data uniformizing device further includes:
a third determining module, configured to obtain a RowKey to be retrieved in the RowKey data set corresponding to the primary key content, and determine, according to the first key value corresponding to the RowKey, a target Region in the HBase cluster where primary key data corresponding to the RowKey is stored;
and the writing module is used for positioning at least one Region server with a scheduling relation with the determined target Region, reading the primary key data corresponding to the RowKey through the Region server after communication connection with the Region server is established, and writing the primary key data read by the Region server into a local preset data table through communication connection.
Further, the writing module further includes:
the third acquisition module is used for monitoring the running state of the Region servers participating in the read-write task in real time, and acquiring the target Region associated with the Region server which is down when the running state of one of the Region servers is down;
the establishing module is used for enabling the Region to release the scheduling relation with the Region server which is down, and establishing the scheduling relation between the target Region and the Region server with at least one normal running state according to a preset scheduling rule.
Further, the writing module further includes:
the judging module is used for judging whether the load degree of each region server exceeds a preset load degree;
and the distribution module is used for stopping the operation tasks associated with the region servers exceeding the preset load degree if the load degree of one of the region servers exceeds the preset load degree, marking the region servers exceeding the preset load degree as a split point, and distributing the operation tasks corresponding to the split point to other region servers not exceeding the preset load degree.
Further, the writing module further includes:
the optimization processing module is used for acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and carrying out optimization processing on a transmission path of communication connection between the Region servers in the process of carrying out communication connection between the Region servers corresponding to the target Region.
For specific limitations of the data homogenizing device, reference may be made to the above limitation of the data homogenizing method, and no further description is given here. The respective modules in the above-described data uniformizing device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data involved in the data homogenization method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data homogenization method.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the steps of the data homogenization method of the above embodiment, such as steps S10 through S70 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the data homogenizing apparatus in the above embodiments, such as the functions of the modules 11 to 17 shown in fig. 5. In order to avoid repetition, a description thereof is omitted.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the data homogenization method in the above embodiment, such as steps S10 to S70 shown in fig. 2. Alternatively, the computer program when executed by the processor implements the functions of the respective modules/units of the data uniformizing device in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 5. In order to avoid repetition, a description thereof is omitted.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (10)

1. A method of data homogenization, comprising:
setting the maximum data storage amount of Region in the HBase cluster as a preset data threshold;
receiving a data homogenizing instruction for a data table in the HBase cluster, and acquiring all primary key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
Determining the number of regions corresponding to each primary key content in the data table according to the preset data threshold and the total data amount corresponding to each primary key content;
acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
after extracting a first key value from each RowKey in the RowKey data set corresponding to the main key content, sorting each main key data corresponding to the main key content according to the first key value, and obtaining a sequence value after sorting;
determining two boundary elements of each Region corresponding to the main key content from all main key data corresponding to the main key content according to the number of regions corresponding to the main key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the main key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
And acquiring the first key value of the main key data corresponding to the main key content, and distributing the main key data corresponding to the first key value in the main key content to a Region corresponding to the main key content for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region.
2. The method for homogenizing data according to claim 1, wherein the acquiring, by using a preset Hash algorithm, a RowKey dataset corresponding to each of the primary key contents in the data table includes:
hash is applied to each primary key data in the class of storage data corresponding to each primary key content in the data table, so that the HashKey of each primary key data in each primary key content is obtained, and each second key value corresponding to each HashKey is extracted;
acquiring the character length of each primary key data in each primary key content, and respectively calculating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters with corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character with one primary key data in one primary key content through a preset symbol.
3. The method for homogenizing data according to claim 1, wherein after the primary key data corresponding to the first key value in the primary key content is assigned to the Region for storage, the method further comprises:
acquiring a RowKey to be retrieved in the RowKey data set corresponding to the main key content, and determining a target Region stored in the HBase cluster by the main key data corresponding to the RowKey according to the first key value corresponding to the RowKey;
and positioning at least one Region server with a scheduling relation with the determined target Region, after establishing communication connection with the Region server, reading main key data corresponding to the RowKey through the Region server, and writing the main key data read by the Region server into a local preset data table through communication connection.
4. The data homogenizing method of claim 3, wherein after locating at least one Region server having a scheduling relationship with the determined target Region, further comprising:
the operation state of the Region servers participating in the read-write task is monitored in real time, and when one of the operation states of the Region servers is down, the target Region associated with the down Region server is obtained;
And enabling the Region to release the scheduling relation with the Region server which is down, and establishing the scheduling relation between the target Region and the Region server with normal at least one running state according to a preset scheduling rule.
5. The data homogenizing method of claim 3, wherein after the data read by the region server is written into a preset data table in the local area through a communication connection, the method further comprises:
judging whether the load degree of each region server exceeds a preset load degree;
and if the load degree of one of the region servers exceeds the preset load degree, stopping the operation tasks associated with the region servers exceeding the preset load degree, marking the region servers exceeding the preset load degree as a shunt point, and distributing the operation tasks corresponding to the shunt point and not exceeding the preset load degree to other region servers.
6. The data homogenizing method of claim 3, wherein after the data read by the region server is written into a preset data table in the local area through a communication connection, the method further comprises:
And acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and optimizing a transmission path for communication connection between the Region servers in the process of communication connection between the Region servers corresponding to the target Region.
7. A data homogenizing apparatus, comprising:
the setting module is used for setting the maximum data storage amount of Region in the HBase cluster as a preset data threshold;
the first acquisition module is used for receiving a data homogenization instruction of a data table in the HBase cluster and acquiring all primary key contents of the data table; wherein one of the primary key content is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
the first determining module is used for determining the number of regions corresponding to each main key content in the data table according to the preset data threshold and the total data amount corresponding to each main key content;
the second acquisition module is used for acquiring a RowKey data set corresponding to the main key content in each data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of main key data corresponding to the main key content;
The third acquisition module is used for extracting a first key value from each RowKey in the RowKey data set corresponding to the main key content, sorting each main key data corresponding to the main key content according to the first key value, and acquiring a sequence value after sorting;
the second determining module is used for determining two boundary elements of each Region corresponding to the main key content from all main key data corresponding to the main key content according to the number of regions corresponding to the main key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region through the two boundary elements of each Region corresponding to the main key content; the front boundary value and the rear boundary value are both first key values corresponding to the boundary elements; the range between the front boundary value and the rear boundary value of each Region is recorded as the boundary range of the Region;
and the distribution module is used for acquiring the first key value of the main key data corresponding to the main key content, and distributing the main key data corresponding to the first key value in the main key content to a Region corresponding to the main key content for storage when the acquired first key value is confirmed to be within the boundary range of the Region.
8. The data homogenizing apparatus of claim 7, wherein the second acquisition module comprises:
the extraction submodule is used for obtaining the HashKey of each main key data in each main key content by applying Hash to each main key data in the class of storage data corresponding to each main key content in the data table, and extracting each second key value corresponding to each HashKey;
the operation sub-module is used for acquiring the character length of each primary key data in each primary key content, and respectively operating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
the splicing sub-module is used for acquiring the characters corresponding to the third key values from a preset character distribution table, and splicing the acquired characters and the corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character with one primary key data in one primary key content through a preset symbol.
9. Computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data homogenization method according to any one of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the data homogenization method according to any one of claims 1 to 6.
CN202010066910.5A 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium Active CN111259012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066910.5A CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066910.5A CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111259012A CN111259012A (en) 2020-06-09
CN111259012B true CN111259012B (en) 2024-03-12

Family

ID=70952479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066910.5A Active CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111259012B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
CN116991910A (en) * 2022-04-26 2023-11-03 华为技术有限公司 Control method and device of data processing device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN105681414A (en) * 2016-01-14 2016-06-15 深圳市博瑞得科技有限公司 Method and system for avoiding data hotspot of Hbase
CN106528819A (en) * 2016-11-16 2017-03-22 北京集奥聚合科技有限公司 Method and system for reading and writing time series data by HBase
CN107273482A (en) * 2017-06-12 2017-10-20 北京市天元网络技术股份有限公司 Alarm data storage method and device based on HBase
CN110362549A (en) * 2019-06-17 2019-10-22 平安普惠企业管理有限公司 Log memory search method, electronic device and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN105681414A (en) * 2016-01-14 2016-06-15 深圳市博瑞得科技有限公司 Method and system for avoiding data hotspot of Hbase
CN106528819A (en) * 2016-11-16 2017-03-22 北京集奥聚合科技有限公司 Method and system for reading and writing time series data by HBase
CN107273482A (en) * 2017-06-12 2017-10-20 北京市天元网络技术股份有限公司 Alarm data storage method and device based on HBase
CN110362549A (en) * 2019-06-17 2019-10-22 平安普惠企业管理有限公司 Log memory search method, electronic device and computer equipment

Also Published As

Publication number Publication date
CN111259012A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
WO2020186786A1 (en) File processing method and apparatus, computer device and storage medium
CN108255958B (en) Data query method, device and storage medium
CN109787908B (en) Server current limiting method, system, computer equipment and storage medium
CN107515878B (en) Data index management method and device
CN106407207B (en) Real-time newly-added data updating method and device
CN111797096A (en) Data indexing method and device based on ElasticSearch, computer equipment and storage medium
CN111259012B (en) Data homogenizing method, device, computer equipment and storage medium
CN111813805A (en) Data processing method and device
CN111191079B (en) Document content acquisition method, device, equipment and storage medium
CN111143331B (en) Data migration method, device and computer storage medium
CN103942292A (en) Virtual machine mirror image document processing method, device and system
CN113419824A (en) Data processing method, device, system and computer storage medium
CA3128540C (en) Cache system hotspot data access method, apparatus, computer device and storage medium
CN112148693A (en) Data processing method, device and storage medium
CN109325026B (en) Data processing method, device, equipment and medium based on big data platform
CN112559529A (en) Data storage method and device, computer equipment and storage medium
CN111885184A (en) Method and device for processing hot spot access keywords in high concurrency scene
CN111949681A (en) Data aggregation processing device and method and storage medium
CN110888972A (en) Sensitive content identification method and device based on Spark Streaming
CN111274291B (en) Query method, device, equipment and medium for user access data
CN114253456A (en) Cache load balancing method and device
CN112015718A (en) HBase cluster balancing method and device, electronic equipment and storage medium
CN110609707B (en) Online data processing system generation method, device and equipment
CN115033551A (en) Database migration method and device, electronic equipment and storage medium
CN114461606A (en) Data storage method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant