CN111259012A - Data homogenizing method and device, computer equipment and storage medium - Google Patents

Data homogenizing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111259012A
CN111259012A CN202010066910.5A CN202010066910A CN111259012A CN 111259012 A CN111259012 A CN 111259012A CN 202010066910 A CN202010066910 A CN 202010066910A CN 111259012 A CN111259012 A CN 111259012A
Authority
CN
China
Prior art keywords
data
primary key
region
value
rowkey
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010066910.5A
Other languages
Chinese (zh)
Other versions
CN111259012B (en
Inventor
郑金伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010066910.5A priority Critical patent/CN111259012B/en
Publication of CN111259012A publication Critical patent/CN111259012A/en
Application granted granted Critical
Publication of CN111259012B publication Critical patent/CN111259012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data homogenization method, a data homogenization device, computer equipment and a storage medium. The method comprises the following steps: acquiring all the primary key contents of the data table; determining the number of regions corresponding to each main key content according to the maximum data storage capacity of the regions and the total data volume corresponding to each main key content; acquiring a RowKey data set corresponding to the content of a primary key; after a first key value is extracted from each RowKey in the RowKey data set, sequencing each primary key data corresponding to the primary key content according to the first key value to obtain a sequence value; determining the boundary range of each Region corresponding to the main key content from all main key data according to the number and the sequence value of the regions corresponding to the main key content; and acquiring a first key value of the primary key data in the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to the Region for storage when the size of the first key value is confirmed to be within the boundary range of the Region. The invention can homogenize the data in the data table in each Region.

Description

Data homogenizing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data storage, and in particular, to a data homogenizing method, apparatus, computer device, and storage medium.
Background
The Region is the minimum unit of HBase cluster distribution, and HBase distributes the regions owned by each service table to each Region Server. Generally, only one corresponding Region is initially provided for one service data table, but as the data volume of the service data table increases and reaches a threshold value, the Region is continuously split into a plurality of regions, and if the data volume of the service data table is large, the data of each service data table is unevenly distributed in the regions, and the Region server associated with the Region is stressed, so that the read-write performance of the HBase associated with the Region server is affected. And the Region is frequently split, so that the resource of a server where the HBase cluster is located is greatly occupied, and the read-write performance of the HBase is influenced. Therefore, a technical solution to the above-mentioned problems is needed.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data uniformizing method, device, computer device and storage medium for uniformizing data in a data table in each Region, so as to improve the read-write performance of the Region server corresponding to the Region.
A method of data homogenization, comprising:
setting the maximum data storage capacity of a Region in the HBase cluster as a preset data threshold;
receiving a data homogenization instruction for a data table in the HBase cluster, and acquiring all main key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
determining the number of regions corresponding to each main key content in the data table according to the preset data threshold and the total data volume corresponding to each main key content;
acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequencing each primary key data corresponding to the primary key content according to the first key value, and acquiring a sequenced value after sequencing;
determining two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content according to the number of the regions corresponding to the primary key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
and acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region corresponding to the primary key content.
A data homogenizing apparatus, comprising:
the setting module is used for setting the maximum data storage capacity of a Region in the HBase cluster as a preset data threshold;
the first acquisition module is used for receiving a data homogenization instruction of a data table in the HBase cluster and acquiring all primary key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
a first determining module, configured to determine, according to the preset data threshold and the total data amount corresponding to each primary key content, the number of regions corresponding to each primary key content in the data table;
the second acquisition module is used for acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
a third obtaining module, configured to, after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequence each primary key data corresponding to the primary key content according to the first key value, and obtain a sequence value after the sequencing;
a second determining module, configured to determine, according to the number of regions corresponding to the primary key content and the sequence value, two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content, and determine a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
and the distribution module is used for acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region corresponding to the primary key content.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above data homogenizing method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned data homogenizing method.
According to the data homogenization method, the data homogenization device, the computer equipment and the storage medium, the number of regions is limited through the preset data threshold and the total data volume of each main key content in the data table, the regions are prevented from being subjected to unlimited splitting operation frequently, and a large amount of resources of the HBase cluster are prevented from being occupied, so that the overall read-write performance of the HBase cluster is influenced, and the stability of the HBase cluster is improved accordingly; the primary key data corresponding to the first key value in the storage data corresponding to the primary key content are distributed to each Region for storage, so that the primary key data corresponding to the primary key content in the data table are basically averaged to each Region, the operating load pressure of the Region server corresponding to the Region is reduced, the operating speed of the Region server is improved, and the read-write performance of the Region server is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a diagram illustrating an application environment of a data homogenizing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data uniformization method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the data uniformizing method step S40 in an application environment according to an embodiment of the present invention;
FIG. 4 is a flow chart of the data uniformization method in an application environment after step S70 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data uniformization apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data homogenizing method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client communicates with a server through a network. Among other things, the client may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a data uniformizing method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:
s10, setting the maximum data storage capacity of regions in the HBase cluster as a preset data threshold;
understandably, the HBase (hadoop database) is a distributed storage system with high reliability, high performance, column-oriented and scalability, and a large-scale structured storage cluster (HBase cluster) can be built on a server by utilizing the HBase technology; the preset data threshold represents a data storage amount of a Region (a data partition), and the preset data threshold may be set according to a requirement, for example, the preset data threshold may be 256M, and at this time, the server may also automatically set the maximum data storage amount of the Region as the preset data threshold.
S20, receiving a data homogenization instruction for a data table in the HBase cluster, and acquiring all primary key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
understandably, the data table includes a plurality of primary key contents, each primary key content is associated with a total data volume of a type of stored data (the total data volume that the primary key contents should occupy in the data table is preset in the data table), for example, the primary key contents may be names, telephone numbers, identity cards, and the like, so the primary key data in the type of stored data associated with the primary key contents may be specific names, specific telephone numbers, and specific identity numbers, and how much data capacity (total data volume) the specific names, the specific telephone numbers, and the specific identity numbers should occupy in the data table is specified in the data table.
S30, determining the number of regions corresponding to each primary key content in the data table according to the preset data threshold and the total data volume corresponding to each primary key content;
understandably, the total data volume of one primary key content may be used to determine the number of regions corresponding to each primary key content, for example, if the total data volume of one primary key content is 2560M in size and the preset data threshold is 256M in size, the number of regions is determined to be 10, and if the total data volume of one primary key content is 500M in size and the preset data threshold is 256M in size, the number of regions is determined to be 2 (when the total data volume is not exactly divided by the preset data threshold, the remainder may be rounded, and the number of regions of one integer is determined); one primary key content may correspond to at least one Region (the number of the corresponding regions may be determined by the total data amount corresponding to the primary key content, wherein the larger the total data amount is, the larger the number of the corresponding regions is).
S40, acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
understandably, the Hash is to convert an input with an arbitrary length (also called a pre-mapping) into an output with a fixed length by a Hash algorithm (the detailed preset Hash algorithm can look at steps S401 to S403); and each piece of primary key data corresponding to each piece of primary key content can calculate the RowKey through a preset Hash algorithm, and when the RowKey is calculated by a plurality of pieces of primary key data, a RowKey data set consisting of the RowKey can be obtained.
S50, after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequencing each primary key data corresponding to the primary key content according to the first key value, and acquiring a sequenced value;
understandably, each piece of primary key data in the type of storage data corresponding to each piece of primary key content can obtain a first key value corresponding to the RowKey through a preset Hash algorithm, that is, each RowKey can be associated with a first key value (when the first key value has both numbers and letters, dictionary sorting can be used, only the letters are sorted or only the numbers are sorted), and when the first key value has only numbers, the first key value also has a numerical value, so that each piece of primary key data associated with the RowKey can be sorted according to the numerical value of the first key value, and the mentioned sequential value is a mark corresponding to the sorted first key value (for example, sorted according to the numerical number, so that the sequential value can be 1, 2, 3, and the like).
S60, determining two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content according to the number of the regions corresponding to the primary key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
understandably, each Region has a boundary range of the same size, so that each Region has two boundaries (i.e., each Region has a front boundary and a back boundary, except that the front boundary value and the back boundary value of each Region are not consistent); the boundary element is specifically represented by primary key data corresponding to the first key value; specifically, since all the primary key contents correspond to all the regions, and the primary key data in the type of storage data corresponding to the primary key contents can determine two boundary elements of each Region according to the sequence value size of the first key value of the RowKey corresponding to the primary key data, the front boundary value and the rear boundary value of each Region corresponding to the primary key contents can be determined by the boundary elements, for example, there are 100 primary key data in total in one primary key content, the first key value of the RowKey corresponding to 100 primary key data has been sorted by the sequence value size, and assuming that the number of the determined regions is 10, the data is equally distributed according to the principle that the regions equally distribute data, so that each Region can correspond to 10 primary key data, and further each Region can determine the first key value of the RowKey corresponding to each Region from 10 primary key data, and the first key value of the RowKey corresponding to 10 primary key data has been sorted by the sequence value size, therefore, each Region can determine the boundary element with the minimum first key value and the boundary element with the maximum first key value from the correspondingly allocated 10 first key values, that is, the boundary element with the minimum sequence value and the boundary element with the maximum sequence value are determined as the front boundary value and the back boundary value of the Region; it should be noted that, because the number of regions is determined by the preset data threshold and the total data amount corresponding to all the primary key contents, when the total data amount is not exactly divided by the preset data threshold, the same data amount as all the previous regions cannot be stored in the last Region, but the data amount of the difference between the last Region and the previous Region is small, and thus, the balanced load of the HBase cluster is not affected.
And S70, acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region corresponding to the primary key content.
Understandably, each primary key content corresponds to a RowKey data set, and the RowKey data set consists of Rowkeys, so each Rowkey is associated with a first key value; the determined boundary range of the Region consists of a front boundary value and a rear boundary value, and the front boundary value and the rear boundary value are preset key values corresponding to a preset RowKey, so that the size of the first key value can be compared with the front boundary value and the rear boundary value; and when the first key value is larger than the front boundary value and smaller than the rear boundary value, the primary key data corresponding to the first key value can be allocated to the Region corresponding to the boundary range for storage.
In the embodiments of steps S10 to S70, the number of regions is limited by presetting a data threshold and the total data size of each primary key content in the data table, so as to prevent the regions from being frequently subjected to unrestricted splitting operation, and prevent a large amount of resources of the HBase cluster from being occupied, thereby affecting the overall read-write performance of the HBase cluster, and therefore the stability of the HBase cluster will be improved accordingly; the primary key data corresponding to the first key value in the storage data corresponding to the primary key content is distributed to each Region for storage, so that the primary key data corresponding to the primary key content in the data table is basically averaged to each Region, the operating load pressure of a Region server (which is the most main component in HBase and is responsible for actual reading and writing of data and managing the Region) corresponding to the Region is reduced, the operating speed of the Region server is increased, and the reading and writing performance of the Region server is improved.
Further, as shown in fig. 3, the obtaining, by using a preset Hash algorithm, a RowKey data set corresponding to each primary key content in the data table includes:
s401, obtaining a HashKey of each primary key data in each primary key content by applying Hash to each primary key data in the type of storage data corresponding to each primary key content in the data table, and extracting each second key value corresponding to each HashKey;
s402, acquiring character lengths of each primary key data in each primary key content, and respectively calculating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
s403, acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters and corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character and one primary key data in one primary key content through a preset symbol.
Understandably, the worrds in the preset character distribution table is limited to 52 characters (arrangement of 26 english letters twice, and a to Z are performed after a to Z are performed), and NUMBERS in the preset character distribution table is limited to 0 to 9 digits, so that the finally obtained RowKey is limited to a preset distribution range, and all the rowkeys are convenient to query and manage subsequently.
Specifically, a Hash key can be obtained from each piece of primary key data in each piece of primary key content through a Hash, a corresponding Hash function exists in the Hash, the Hash function can be specifically selected according to requirements, if one piece of primary key data in the piece of primary key content is a telephone number, the telephone number is 13700000000, the Hash key is obtained through a Hash algorithm, the Hash key is-1088245419, and-1088245419 at this time is the second key value; then, the character length of each primary key data of the primary key content is obtained, namely the length of WORDS and NUMBERS in the primary key data, at this time, each second key value and each character length can be divided, and then the absolute value of the remainder is obtained (preset absolute value algorithm operation), and then a third key value is obtained, if the length of WORDS and NUMBERS is (52, 10), the mentioned HashKey is divided with (52, 10) respectively, and then the absolute value of the remainder is obtained, and at this time, the third key value is (27, 9); finally, WORDS and NUMBERS in the preset character distribution table are used for finding out corresponding characters through the obtained third key value (namely, the specialized hash value calculated for WORDS and NUMBERS), for example, (27, 9) is obtained, at this time, WORDS in the preset character distribution table can be used for finding out 27-bit characters, NUMBERS can be used for finding out 9-bit characters, the extracted characters are spliced with the content of the primary key through preset symbols, and the rowkeys are obtained (each calculated RowKey can form a RowKey data set), for example, the 27-bit characters are assumed to be a, the 9-bit characters are assumed to be 8, the content of the primary key is 13700000000, and therefore, the rowkeys obtained after splicing are a8_ 13700000000.
Further, as shown in fig. 4, after the step S70, the method further includes:
s701, acquiring a RowKey to be retrieved in the RowKey data set corresponding to the primary key content, and determining a target Region, stored in the HBase cluster, of the primary key data corresponding to the RowKey according to the first key value corresponding to the RowKey;
s702, locating at least one RegionServer having a scheduling relation with the determined target Region, establishing a communication connection with the RegionServer, reading the primary key data corresponding to the RowKey through the RegionServer, and writing the primary key data read by the RegionServer into a local preset data table through the communication connection.
In this embodiment, after the Region corresponding to the Region can be found by the RowKey, a communication connection with the Region server is established, and the Region server corresponding to the Region can quickly read the primary key data related to the RowKey in the Region.
Further, after the step S702, the method further includes:
monitoring the operating states of the RegionServers participating in the read-write task in real time, and acquiring the target Region associated with the RegionServer which is down when the operating state of one RegionServer is down;
and enabling the Region to release the scheduling relation between the Region and the Region server in the downtime, and establishing the scheduling relation between the target Region and at least one Region server with a normal operation state according to a preset scheduling rule.
The embodiment is to establish a scheduling relationship between the Region and the Region server without problems when the Region server has problems in the operating state, so as to ensure smooth data query, i.e. ensure query efficiency.
Further, after the step S702, the method further includes:
judging whether the load degree of each RegionServer exceeds a preset load degree or not;
if the load degree of one of the RegionServers exceeds a preset load degree, stopping the running tasks related to the RegionServers exceeding the preset load degree, marking the RegionServers exceeding the preset load degree as a shunting point, and distributing the running tasks of the overload corresponding to the shunting point to other RegionServers not exceeding the preset load degree.
Specifically, if the load degree of one of the Region servers exceeds the preset load degree, the Region servers are driven to run, the running tasks related to the Region servers exceeding the preset load degree can be stopped, if the load degree of one of the Region servers exceeds the preset load degree, the running tasks exceeding the preset load degree are distributed to other Region servers not exceeding the preset load degree (wherein, the running tasks can be distributed to at least one Region server having a scheduling relation with the same Region; when one Region server having no scheduling relation with the same Region exists, the running tasks can also be distributed to the Region server having no scheduling relation with the same Region, and then the scheduling relation between the Region and the Region server is established).
In the embodiment, when one of the regionServer exceeds the preset load degree, the running task of the regionServer is shunted to the idle regionServer, so that the normal running of all the regionServer is ensured.
Further, after the step S702, the method further includes:
acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and performing transmission path optimization processing on communication connection between the Region servers in the communication connection process of the Region servers corresponding to the target Region.
Understandably, the transmission path optimization process includes, but is not limited to, increasing throughput of the RegionServer, increasing bandwidth of the communication connection, and the like.
In this embodiment, when the access frequency reaches the preset frequency threshold, the Region is determined as an access hotspot, and in the process of performing communication connection between the server and the Region server, the Region server is enabled to optimize the transmission path of the communication connection (by releasing the Region server and the running memory that is not related to the current transmission), so that the transmission rate can be increased, and the transmission time can be saved.
In summary, the above-mentioned method for homogenizing data is provided, the number of regions is limited by presetting a data threshold and the total data volume of each primary key content in the data table, so as to prevent the regions from being frequently subjected to unrestricted splitting operation, and prevent a large amount of resources of the HBase cluster from being occupied, thereby affecting the overall read-write performance of the HBase cluster, and therefore the stability of the HBase cluster will be improved; the primary key data corresponding to the first key value in the storage data corresponding to the primary key content are distributed to each Region for storage, so that the primary key data corresponding to the primary key content in the data table are basically averaged to each Region, the operating load pressure of the Region server corresponding to the Region is reduced, the operating speed of the Region server is improved, and the read-write performance of the Region server is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, a data uniformization apparatus is provided, and the data uniformization apparatus corresponds to the data uniformization method in the foregoing embodiment one to one. As shown in fig. 5, the data uniformizing apparatus includes a setting module 11, a first obtaining module 12, a first determining module 13, a second obtaining module 14, a third obtaining module 15, a second determining module 16, and a distributing module 17. The functional modules are explained in detail as follows:
a setting module 11, configured to set a maximum data storage amount of a Region in the HBase cluster as a preset data threshold;
a first obtaining module 12, configured to receive a data homogenization instruction for a data table in the HBase cluster, and obtain all primary key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
a first determining module 13, configured to determine, according to the preset data threshold and the total data amount corresponding to each primary key content, the number of regions corresponding to each primary key content in the data table;
a second obtaining module 14, configured to obtain, by using a preset Hash algorithm, a RowKey dataset corresponding to each primary key content in the data table; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
a third obtaining module 15, configured to, after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequence each primary key data corresponding to the primary key content according to the first key value, and obtain a sequence value after the sequencing;
a second determining module 16, configured to determine, according to the number of regions corresponding to the primary key content and the sequence value, two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content, and determine a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
and the allocating module 17 is configured to acquire the first key value of the primary key data corresponding to the primary key content, and allocate the primary key data corresponding to the first key value in the primary key content to a Region for storage when it is determined that the size of the acquired first key value is within the boundary range of the Region corresponding to the primary key content.
Further, the second obtaining module includes:
the extraction submodule is used for obtaining the HashKey of each primary key data in each primary key content by applying Hash to each primary key data in the type of storage data corresponding to each primary key content in the data table, and extracting each second key value corresponding to each HashKey;
the operation submodule is used for acquiring the character length of each primary key data in each primary key content, and respectively operating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
the splicing submodule is used for acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters and corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character and one primary key data in one primary key content through a preset symbol.
Further, the data uniformization apparatus further includes:
a third determining module, configured to obtain a RowKey to be retrieved in the RowKey data set corresponding to the primary key content, and determine, according to the first key value corresponding to the RowKey, a target Region where the primary key data corresponding to the RowKey is stored in the HBase cluster;
and the writing module is used for positioning at least one Region server which has a scheduling relation with the determined target Region, reading the primary key data corresponding to the RowKey through the Region server after establishing communication connection with the Region server, and writing the primary key data read by the Region server into a local preset data table through communication connection.
Further, the write module further includes:
the third obtaining module is used for monitoring the operating states of the regionServer participating in the read-write task in real time, and obtaining the target Region associated with the regionServer which is down when the operating state of one Region _ Server is down;
and the establishing module is used for enabling the Region to release the scheduling relation with the Region server which is down, and establishing the scheduling relation between the target Region and at least one Region server with a normal operation state according to a preset scheduling rule.
Further, the write module further includes:
the judging module is used for judging whether the load degree of each RegionServer exceeds a preset load degree;
and the distribution module is used for stopping the running task related to the RegionServer which exceeds the preset load degree if the load degree of one RegionServer exceeds the preset load degree, marking the RegionServer which exceeds the preset load degree as a shunting point, and distributing the running task of the overload corresponding to the shunting point to other RegionServers which do not exceed the preset load degree.
Further, the write module further includes:
and the optimization processing module is used for acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and performing transmission path optimization processing on the communication connection between the Region servers in the communication connection process of the Region servers corresponding to the target Region.
For the specific definition of the data uniformization device, reference may be made to the above definition of the data uniformization method, which is not described herein again. The various modules in the data homogenizing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the data homogenizing method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data homogenizing method.
In one embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the data uniformization method in the above embodiments, such as the steps S10 to S70 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the respective modules/units of the data uniformization apparatus in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 5. To avoid repetition, further description is omitted here.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the data uniformization method in the above embodiments, such as the steps S10 to S70 shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of the respective modules/units of the data uniformization apparatus in the above-described embodiments, such as the functions of the modules 11 to 17 shown in fig. 5. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method of data homogenization, comprising:
setting the maximum data storage capacity of a Region in the HBase cluster as a preset data threshold;
receiving a data homogenization instruction for a data table in the HBase cluster, and acquiring all main key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
determining the number of regions corresponding to each main key content in the data table according to the preset data threshold and the total data volume corresponding to each main key content;
acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequencing each primary key data corresponding to the primary key content according to the first key value, and acquiring a sequenced value after sequencing;
determining two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content according to the number of the regions corresponding to the primary key content and the sequence value, and determining a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
and acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region corresponding to the primary key content.
2. The data uniformization method according to claim 1, wherein the obtaining of the RowKey data set corresponding to each of the primary key contents in the data table by using the preset Hash algorithm comprises:
obtaining a HashKey of each primary key data in each primary key content by applying Hash to each primary key data in the type of storage data corresponding to each primary key content in the data table, and extracting each second key value corresponding to each HashKey;
acquiring the character length of each primary key data in each primary key content, and respectively calculating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters and corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character and one primary key data in one primary key content through a preset symbol.
3. The data uniformization method according to claim 1, wherein after assigning the primary key data corresponding to the first key value in the primary key content to the Region for storage, the method further comprises:
acquiring a RowKey to be retrieved in the RowKey data set corresponding to the primary key content, and determining a target Region of the primary key data corresponding to the RowKey stored in the HBase cluster according to the first key value corresponding to the RowKey;
and positioning at least one RegionServer having a scheduling relation with the determined target Region, after establishing communication connection with the RegionServer, reading the primary key data corresponding to the RowKey through the RegionServer, and writing the primary key data read by the RegionServer into a local preset data table through communication connection.
4. The data uniformization method according to claim 3, wherein after locating at least one Region Server having a scheduling relationship with the determined target Region, further comprising:
monitoring the operating states of the RegionServers participating in the read-write task in real time, and acquiring the target Region associated with the RegionServer which is down when the operating state of one RegionServer is down;
and enabling the Region to release the scheduling relation between the Region and the Region server in the downtime, and establishing the scheduling relation between the target Region and at least one Region server with a normal operation state according to a preset scheduling rule.
5. The data uniformization method according to claim 3, wherein after writing the data read by the RegionServer into a preset data table in a local via a communication connection, the method further comprises:
judging whether the load degree of each RegionServer exceeds a preset load degree or not;
if the load degree of one of the RegionServers exceeds a preset load degree, stopping the running tasks related to the RegionServers exceeding the preset load degree, marking the RegionServers exceeding the preset load degree as a shunting point, and distributing the running tasks of the overload corresponding to the shunting point to other RegionServers not exceeding the preset load degree.
6. The data uniformization method according to claim 3, wherein after writing the data read by the RegionServer into a preset data table in a local via a communication connection, the method further comprises:
acquiring the access times of each target Region, setting the target Region as an access hot spot when the access times of the target Region reach a preset time threshold, and performing transmission path optimization processing on communication connection between the Region servers in the communication connection process of the Region servers corresponding to the target Region.
7. A data uniformizing apparatus, comprising:
the setting module is used for setting the maximum data storage capacity of a Region in the HBase cluster as a preset data threshold;
the first acquisition module is used for receiving a data homogenization instruction of a data table in the HBase cluster and acquiring all primary key contents of the data table; wherein one of the primary key contents is associated with a total data amount of one of the types of stored data of the data table; the storage data corresponding to the primary key content comprises at least one primary key data;
a first determining module, configured to determine, according to the preset data threshold and the total data amount corresponding to each primary key content, the number of regions corresponding to each primary key content in the data table;
the second acquisition module is used for acquiring a RowKey data set corresponding to each primary key content in the data table by using a preset Hash algorithm; the RowKey data set comprises RowKey of each piece of primary key data corresponding to the primary key content;
a third obtaining module, configured to, after extracting a first key value from each RowKey in the RowKey data set corresponding to the primary key content, sequence each primary key data corresponding to the primary key content according to the first key value, and obtain a sequence value after the sequencing;
a second determining module, configured to determine, according to the number of regions corresponding to the primary key content and the sequence value, two boundary elements of each Region corresponding to the primary key content from all primary key data corresponding to the primary key content, and determine a front boundary value and a rear boundary value of each Region corresponding to the primary key content through the two boundary elements of the Region; the front boundary value and the rear boundary value are first key values corresponding to the boundary elements; recording the range between the front boundary value and the rear boundary value of each Region as the boundary range of the Region;
and the distribution module is used for acquiring the first key value of the primary key data corresponding to the primary key content, and distributing the primary key data corresponding to the first key value in the primary key content to a Region for storage when the size of the acquired first key value is confirmed to be within the boundary range of the Region corresponding to the primary key content.
8. The data uniformizing apparatus according to claim 7, wherein the second obtaining module comprises:
the extraction submodule is used for obtaining the HashKey of each primary key data in each primary key content by applying Hash to each primary key data in the type of storage data corresponding to each primary key content in the data table, and extracting each second key value corresponding to each HashKey;
the operation submodule is used for acquiring the character length of each primary key data in each primary key content, and respectively operating each second key value and each character length by using an absolute value algorithm to obtain each third key value;
the splicing submodule is used for acquiring characters corresponding to the third key values from a preset character distribution table, and correspondingly splicing the acquired characters and corresponding primary key data through preset symbols to obtain the RowKey data set; each RowKey in the RowKey data set is formed by splicing at least one character and one primary key data in one primary key content through a preset symbol.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data homogenizing method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a data homogenization method according to any one of claims 1 to 6.
CN202010066910.5A 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium Active CN111259012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066910.5A CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066910.5A CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111259012A true CN111259012A (en) 2020-06-09
CN111259012B CN111259012B (en) 2024-03-12

Family

ID=70952479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066910.5A Active CN111259012B (en) 2020-01-20 2020-01-20 Data homogenizing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111259012B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
WO2023207832A1 (en) * 2022-04-26 2023-11-02 华为技术有限公司 Control method and device of data processing device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN105681414A (en) * 2016-01-14 2016-06-15 深圳市博瑞得科技有限公司 Method and system for avoiding data hotspot of Hbase
CN106528819A (en) * 2016-11-16 2017-03-22 北京集奥聚合科技有限公司 Method and system for reading and writing time series data by HBase
CN107273482A (en) * 2017-06-12 2017-10-20 北京市天元网络技术股份有限公司 Alarm data storage method and device based on HBase
CN110362549A (en) * 2019-06-17 2019-10-22 平安普惠企业管理有限公司 Log memory search method, electronic device and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399945A (en) * 2013-08-15 2013-11-20 成都博云科技有限公司 Data structure based on cloud computing database system
CN104794123A (en) * 2014-01-20 2015-07-22 阿里巴巴集团控股有限公司 Method and device for establishing NoSQL database index for semi-structured data
CN105681414A (en) * 2016-01-14 2016-06-15 深圳市博瑞得科技有限公司 Method and system for avoiding data hotspot of Hbase
CN106528819A (en) * 2016-11-16 2017-03-22 北京集奥聚合科技有限公司 Method and system for reading and writing time series data by HBase
CN107273482A (en) * 2017-06-12 2017-10-20 北京市天元网络技术股份有限公司 Alarm data storage method and device based on HBase
CN110362549A (en) * 2019-06-17 2019-10-22 平安普惠企业管理有限公司 Log memory search method, electronic device and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181831A (en) * 2020-09-28 2021-01-05 中国平安财产保险股份有限公司 Script performance verification method, device and equipment based on keywords and storage medium
WO2023207832A1 (en) * 2022-04-26 2023-11-02 华为技术有限公司 Control method and device of data processing device

Also Published As

Publication number Publication date
CN111259012B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN109787908B (en) Server current limiting method, system, computer equipment and storage medium
CN107622091B (en) Database query method and device
TWI694700B (en) Data processing method and device, user terminal
CN111259012B (en) Data homogenizing method, device, computer equipment and storage medium
CN111885184A (en) Method and device for processing hot spot access keywords in high concurrency scene
CN109325026B (en) Data processing method, device, equipment and medium based on big data platform
CN112148693A (en) Data processing method, device and storage medium
CN111143331A (en) Data migration method and device and computer storage medium
CN111988429A (en) Algorithm scheduling method and system
CN112559529A (en) Data storage method and device, computer equipment and storage medium
CN111182043A (en) Hash value distribution method and device
CN110941681B (en) Multi-tenant data processing system, method and device of power system
CN114253456A (en) Cache load balancing method and device
CN110609707B (en) Online data processing system generation method, device and equipment
CN111581155A (en) Method and device for inputting data into database and computer equipment
CN111382141A (en) Master-slave architecture configuration method, device, equipment and computer readable storage medium
US9537941B2 (en) Method and system for verifying quality of server
CN115941622A (en) Bandwidth adjusting method, system, equipment and storage medium
CN115033551A (en) Database migration method and device, electronic equipment and storage medium
CN112596825B (en) Cloud desktop starting method and device
CN113392131A (en) Data processing method and device and computer equipment
CN113986846A (en) Data processing method, system, device and storage medium
CN112989147A (en) Data information pushing method and device, computer equipment and storage medium
CN113342806A (en) Big data processing method and device, storage medium and processor
CN113609152A (en) Data processing method and device and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant