CN115934819A - Universal distributed expansion method for industrial time sequence database - Google Patents

Universal distributed expansion method for industrial time sequence database Download PDF

Info

Publication number
CN115934819A
CN115934819A CN202211591049.XA CN202211591049A CN115934819A CN 115934819 A CN115934819 A CN 115934819A CN 202211591049 A CN202211591049 A CN 202211591049A CN 115934819 A CN115934819 A CN 115934819A
Authority
CN
China
Prior art keywords
data
database
sub
library
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211591049.XA
Other languages
Chinese (zh)
Inventor
周淳
王想
史英杰
周时颉
王鑫晨
史金伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINA REALTIME DATABASE CO LTD
Original Assignee
CHINA REALTIME DATABASE CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA REALTIME DATABASE CO LTD filed Critical CHINA REALTIME DATABASE CO LTD
Priority to CN202211591049.XA priority Critical patent/CN115934819A/en
Publication of CN115934819A publication Critical patent/CN115934819A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a general distributed expansion method of an industrial time sequence database, which constructs a system architecture comprising a database system access interface module, a database scheduling and managing module and a data calculation storage module; determining a data distribution rule: measuring points or time are used as distribution dimensions, so that the concurrency efficiency of task processing is improved to the maximum extent, and the complexity of data transmission and summarization is reduced; performing warehouse-based management; transferring and uniformly distributing task distribution and result summarizing processing logic to each access interface module by a scheduling management node; the reliability of the system is ensured through a data copy mode, and if the original time sequence database has a mirror image function, a main and standby mirror image mode is selected; and the unified database system accesses the interface module. The invention not only solves the problems of data distribution, data redundancy, data synchronization and the like, but also can completely reuse the storage and calculation engines of the existing time sequence database system, has small modification workload and provides technical support for the expansion of the current industrial time sequence database.

Description

Universal distributed expansion method for industrial time sequence database
Technical Field
The invention relates to a distributed expansion method of a real-time database, in particular to a general distributed expansion method of an industrial time sequence database.
Background
Time series data in the industrial field are generally data collected or generated by various types of real-time monitoring and analyzing equipment in the industries of electric power, chemical industry and the like, and the data and the typical characteristics of the use thereof are as follows: the method has the advantages of strong time sequence (each piece of data contains a timestamp for identifying data acquisition time, and only one acquisition point can generate unique data at one moment), high data acquisition frequency (each acquisition device or acquisition point can generate a plurality of pieces of data within one second), high data throughput, high real-time requirement, and generally time-based data query.
The early industrial time sequence database is generally deployed in a single-core centralized manner, along with upgrading of hardware such as acquisition equipment and a network, the application of the acquisition equipment is wider, the acquisition frequency is improved, the data volume generated by an industrial real-time system is exponentially increased, and higher requirements are provided for the management of mass data and the real-time performance of application by each business application.
The distributed expansion architecture design by directly introducing mature distributed computing and storage technology usually needs to largely reform and even completely rewrite the computing logic and core storage of the original single-machine database, and a lightweight general industrial time sequence database distributed expansion method is designed by researching the key technologies (data distribution, data flow, database management, data redundancy, synchronization and the like) of the distributed time sequence database and combining the characteristics of industrial time sequence data, so that the storage and computing engine of the existing time sequence database system can be completely reused, the scale and performance bottleneck of the existing single-machine database can be broken through, higher data reliability can be provided, and technical support can be provided for the application development of the industrial time sequence database.
Disclosure of Invention
The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a universal distributed expansion method for an industrial time sequence database, which not only solves the problems of limited capacity and weak processing capacity of the traditional single-core database aiming at the characteristics of industrial real-time mass data, but also solves the problem of overlarge workload of the traditional database modification by directly introducing a distributed system architecture, and meets the requirements of large data volume and high real-time property in industrial real-time mass data processing.
The technical scheme is as follows: the invention provides a universal distributed expansion method for an industrial time sequence database, which comprises the following parts:
(1) Constructing a distributed real-time database system comprising a database system access interface module, a database scheduling and managing module and a data calculation and storage module;
(2) Determining a data distribution rule: measuring points or time are used as distribution dimensions, so that the concurrency efficiency of task processing is improved to the maximum extent, and the complexity of data transmission and summarization is reduced;
(3) And (3) sub-warehouse management: the database scheduling and managing module is used for processing database information management, monitoring of database state, load balancing and automatic scheduling of database resources, and considering database data migration caused by physical node change;
(4) Transferring and uniformly distributing task distribution and result summarizing processing logic to each access interface module by a scheduling management node;
(5) Data redundancy and synchronization: the reliability of the system is ensured through a data copy mode, and if the original time sequence database has a mirror image function, a main and standby mirror image mode is selected;
(6) And the unified database system access interface module provides transparent read-write access to the whole time sequence database system for various applications.
Furthermore, the database system access interface module provides transparent read-write access to the whole database system for various applications; the database scheduling and managing module defines a rule of data distribution, is distributed in one or more nodes, is a manager of the whole time sequence database system, monitors and manages the state and task scheduling of all database sub-databases in real time, and provides the meta information to the database access interface module; the data calculation and storage module consists of a plurality of database sub-databases, each database sub-database provides independent calculation and storage services, and each sub-database receives or executes decomposed data or query tasks.
Further, the step (2) comprises the steps of:
(21) Extracting a mapping relation between a measuring point and equipment according to a data model of an original single-machine database;
(22) Extracting attribute characteristic values of the measuring points: a coding rule f is self-defined, for a certain measuring point A, partial attributes of the measuring point A can be converted through f, f (A), namely attribute characteristic values of the measuring point A, needs to ensure the invariability and uniqueness of the characteristic values f (A), namely f (A) does not change along with the change of data collected by the measuring point A, and the uniqueness, namely for the data collected by the measuring point A, f (A) does not change along with the change of the data collected by the measuring point A
Figure BDA0003994364760000023
And &>
Figure BDA0003994364760000022
If A ≠ B, f (A) ≠ f (B);
(23) Determining relevant distribution dimensions including two dimensions of a measuring point attribute characteristic value and time according to the main data;
(24) Designing a distribution rule, namely defining a distribution function k = p (x, y, z), wherein x is a characteristic value f (A) of the attribute of the measuring point, y is a time point corresponding to data, z is the number of currently set data fragments, and k is a result of the data of a certain measuring point at a certain moment after the data is subjected to distribution function operation and represents the serial number of the data fragments to be classified.
Further, the step (3) is realized as follows:
the single sub-library is only positioned at a certain physical node in the cluster, one physical node comprises a plurality of sub-libraries, and the single sub-library only processes data of one data fragment, is a relatively independent working unit and has independent running threads, cache spaces and persistent storage paths;
the number of sub-libraries is larger than the number of physical data nodes Node _ Num, the number of sub-libraries Num _ perNode which can be supported by a single physical Node to the maximum is calculated according to the occupation of resources in the running process of the original single library, and the maximum value of P _ Num is the product of Node _ Num and Num _ perNode;
with the increase of the total data amount, the capacity expansion of the physical data nodes is considered, the size of the P _ num is set to be unchanged, and the granularity of data migration is set to be the whole data fragment.
Further, the step (5) is realized as follows:
data redundancy adopts two modes of mirror image copy or more and more: the method comprises the steps that a mirror mode, namely a main-standby mode, a main library and one or more standby libraries are adopted, each database sub-library comprises the main library and the mirror library to form a group or a chain, a write request for one sub-library is firstly sent to the main library in the sub-library, and the main library receives data and automatically synchronizes to the corresponding mirror library; registering and updating the state of a scheduling management node in real time when a main library and a mirror library are started, wherein when database application is initialized and connected, the scheduling management node allocates one library from each group, if the group is connected by a write task, the main library in the group is selected, and if the group is connected by a read task, the sub-library with the minimum pressure is selected in each group;
each database sub-library in a multi-activity mode is divided into a plurality of peer-to-peer libraries to form a group, all the libraries in the group are mirror images, a write request for a certain sub-library can be sent to any one of the groups, the databases receive data and can be automatically synchronized to the peer-to-peer libraries, all the peer-to-peer libraries register with a scheduling management node when being started and update the state in real time, and when the databases are initially connected, the scheduling management node can select the sub-library with the minimum pressure from each group for distribution.
Further, the step (6) is realized as follows:
when the access interface of the unified database system establishes connection every time, obtaining information of sub-libraries to be connected and data distribution rules in all latest sub-library groups from the scheduling management service and generating a sub-library information table, and simultaneously connecting all sub-library groups; when data reading and writing tasks are executed each time, the measured point data and the value data are logically split according to a sub-base group through a data distribution rule, the bottom layer calls an original database interface to process data, and finally all results are organized and summarized and returned to the application.
Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the scale bottleneck of a centralized time sequence database measuring point (label point) is broken through; the bottleneck of the overall read-write performance of the centralized time sequence database is broken through; the reuse degree of the original centralized time sequence database is high, and the modification amount is small; the data reliability and the stability of the whole database system are improved.
Drawings
FIG. 1 is a schematic diagram of a distributed real-time database system deployment architecture;
FIG. 2 illustrates the movement of data fragments during capacity expansion of a data node;
FIG. 3 is a write data flow for task scheduling by the access interface module;
FIG. 4 is a read data flow for task scheduling by the access interface module;
FIG. 5 is a data write for data redundancy in a mirrored manner;
FIG. 6 is a data write for dual active mode data redundancy;
fig. 7 is the location of the unified access interface module in the system.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The invention provides a general distributed expansion method for an industrial time sequence database, which comprises the following steps as shown in figure 1:
step 1: and constructing a distributed real-time database system deployment system. The whole system is divided into three parts, namely a database system interface module (embedded in an application in an API form), a database scheduling and managing module (scheduling management cluster) and a data calculation storage module (database cluster).
The database system access interface module provides transparent read-write access to the whole database system for various applications, the transparent access means that the applications are connected with the database cluster through the access interface, the size of the database cluster and the increase and decrease of database nodes do not need to be concerned, and the operation of the application on the whole database cluster is similar to that of a single database.
The data calculation and storage module consists of a plurality of time sequence database instances (database sub-databases), each database sub-database provides independent calculation and storage services, is equivalent to the original database with a single-machine architecture, and is basically used; each sub-library receives or executes the decomposed data or query tasks. All databases are deployed in a plurality of database nodes (physical nodes) to form a database cluster; the number of data nodes is limited only by the hard conditions such as ethernet bandwidth and physical conditions of a machine room. Each database sub-library only stores data belonging to a corresponding partition, and is logically equivalent.
The database scheduling and managing module is distributed in one or more nodes and is a manager of the whole time sequence database system, the module needs to define rules of data distribution, in addition, the real-time monitoring and management are carried out on the state and task scheduling of all database sub-databases, the meta information is provided for the database access interface module, and the whole scheduling and managing module is composed of a plurality of completely equivalent services.
Step 2: and establishing a distribution rule.
An important data source of industrial time-series databases is equipment acquisition; because the devices are relatively independent, data generated by a plurality of devices has little relevance, and generally, correlation queries between tables (measuring points) are rarely needed, and more historical data generated by the same area or the same device are aggregated. Therefore, the database sub-base expansion adopts the measuring points or time as a distribution dimension, so that the concurrency efficiency of task processing can be improved to the maximum extent, and the complexity of data transmission and summarization can be reduced. A simple data distribution method is to distribute measuring points of different regions to different segments, and the method has narrow application range and is easy to cause data hot spot problem. Summarizing the distribution function design principle, two goals exist: 1) The data are distributed as evenly as possible; 2) The complexity of data aggregation query is reduced; the data balanced distribution ensures the load balance of data writing and reading, and reduces the occurrence probability of cluster hot spot problems; the complexity of the data aggregation query mainly lies in whether the aggregation query can be simply split into sub-queries and distributed to each sub-library for direct summarization after execution. For example, if data time is taken as a dimension, data of a certain measuring point is distributed to a plurality of sub-libraries according to time, if aggregate queries such as mean square error and the like need to be performed on all data of the measuring point, it is difficult to split the queries and submit the queries to each sub-library for execution and then simply collect the queries, if a characteristic value of the measuring point is taken as a distribution dimension, all data of the certain measuring point can be in only one sub-library, and at the moment, the aggregate queries can be completed only by being distributed to one sub-library for execution, so that extra data transmission cost is reduced. The method for determining the data distribution rule is as follows:
1) And extracting the mapping relation between the measuring points (tables) and the equipment (acquisition indexes) according to the data model of the original single-machine database. For example, a transformer of a charging pile with a measuring point corresponding to a certain position in a certain area.
2) Extracting attribute characteristic values of the measuring points: a coding rule f is self-defined, for a certain measuring point A, partial attributes of the measuring point A can be converted through f, f (A), namely attribute characteristic values of the measuring point A, needs to ensure the invariability and uniqueness of the characteristic values f (A), namely f (A) does not change along with the change of data collected by the measuring point A, and the uniqueness, namely for the data collected by the measuring point A, f (A) does not change along with the change of the data collected by the measuring point A
Figure BDA0003994364760000051
And &>
Figure BDA0003994364760000052
If A ≠ B, f (A) ≠ f (B); there are many design methods for the coding rules, for example, all the invariant attributes of the measuring points are converted into characters for splicing, and the uniqueness is ensured.
3) And determining a relevant distribution dimension according to the main data, wherein the relevant distribution dimension generally comprises two dimensions of a measuring point attribute characteristic value and time.
4) Designing a distribution rule, namely defining a distribution function k = p (x, y, z), wherein x is a characteristic value f (A) of the attribute of the measuring point, y is a time point corresponding to data, z is the number of currently set data fragments, and k is a result of the data of a certain measuring point at a certain moment after the data is subjected to distribution function operation and represents the serial number of the data fragments to be classified.
And step 3: and (3) sub-warehouse management: the database scheduling and managing module belongs to a module which needs to be newly added in distributed expansion, and is used for mainly processing database information management, monitoring of database state, load balancing and automatic scheduling of database resources, and database data migration caused by physical node change needs to be considered.
A single sub-database is only located at a certain physical node in the cluster, one physical node can comprise a plurality of sub-databases, the single sub-database only processes data of one data fragment, is a relatively independent working unit, and has independent running threads, cache space and persistent storage paths, and is similar to the original single-machine database.
When the time sequence database cluster is installed, a database splitting information configuration file is required to be automatically or manually generated according to hardware resources of the whole cluster, the configuration file comprises all database splitting identifiers (such as database splitting instance names, and can be automatically generated) and information (such as physical node identification information (such as physical node IP, host) of physical nodes where the databases are located, when all databases are started, the databases are required to be registered with a scheduling management module, and meanwhile, when the databases are operated, load information is required to be reported in real time. The scheduling management module can carry out bidirectional verification through configuration and database registration information to confirm the validity of the database, meanwhile, the scheduling management module takes the database dividing number as one parameter, and each database dividing group corresponds to data (data fragments) in a determined range according to the data distribution rule in the step (2). The meta-information managed by the scheduling management module mainly comprises data distribution rules and sub-base information.
Regarding the principle of setting the number of sub-libraries P _ Num, firstly, the number of sub-libraries must be greater than the number of physical data nodes Node _ Num, secondly, the number of sub-libraries Num _ pernodes that can be supported by a single physical Node to the maximum is calculated according to the resource occupation of the original single library operation process, and then the maximum value of P _ Num is the product of Node _ Num and Num _ perNode.
As shown in fig. 2, as the amount of data increases, the capacity expansion of the physical data nodes may need to be considered. The change of the physical nodes inevitably needs to consider the rebalancing of the data of each node, and the general idea of data redistribution is to migrate the data of each data fragment with the distribution dimension (such as a measuring point) as the minimum granularity according to a new data distribution rule. This approach can consume a large amount of system resources, which may even result in the database system being unavailable for a long time when the total data size is large. If the original database partitioning number is set to be larger, because the data storage of each data partition (database partitioning) is independent and complete, and the size of the database partitioning number can be set to be unchanged by a new distribution rule at the moment, the granularity of data migration can be set to be the whole data partition (database partitioning), and the method can directly perform massive file migration without data decompression and hash operation, so that the expenses of CPU operation, disk IO and network IO are greatly reduced, and the migration efficiency can be improved by hundreds of times.
And 4, step 4: and (4) data flow.
The query application of the industrial time sequence database only needs to simply divide a query task into a plurality of subtasks according to a data distribution rule, distribute the subtasks to each related sub-database, directly gather results after the execution is finished, and does not relate to data exchange among the sub-databases and multi-task scheduling. Therefore, after each access interface module acquires the data and metadata distribution rules from the scheduling management node, the task distribution and result summarization processing logic routing can be directly taken over, namely, the data stream does not need to pass through the scheduling management node.
The task distribution and result collection processing logic is transferred and shared to each access interface module by the scheduling management node, compared with a general distributed system in which the scheduling management node processes the task distribution and scheduling collection, the whole data transmission overhead can be reduced by nearly half, and the access interface module only needs to acquire metadata (data distribution rules and each database state) from the scheduling management node and synchronize in real time. Therefore, the method has two advantages, one is that the overhead of data secondary transmission and processing is reduced and avoided by transferring the modules of the task distribution and result summarization processing logic, and the data can not become the performance bottleneck of the database system. Secondly, the dispatching management node is changed from heavy service to light service which does not relate to actual task processing, the whole dispatching management cluster only needs to ensure consistency of system metadata, and stability and reliability are higher.
As shown in fig. 3 and 4, after receiving a read-write request, an access interface module synchronizes data and metadata distribution rules and each sub-library state from a scheduling management node, screens related data sub-libraries according to measurement points and time ranges related to the request, splits and reorganizes a plurality of sub-requests to be distributed to a plurality of data sub-libraries through the distribution rules, returns results after parallel processing of the plurality of data sub-libraries is completed, performs aggregation processing after all the distributed sub-requests return the results, and feeds the results back to an application server or a client. Each sub-request can only be sent to a certain data sub-library main library or a certain data sub-library standby library, when abnormal, the sub-requests can be continuously sent to another mutual standby sub-library, meanwhile, the abnormal state of the sub-libraries needs to be reported to the scheduling management node, the access interface module selects the distribution priority by judging the busy degree of the main and standby sub-libraries, the data node regularly reports the current activity state and the busy degree to the scheduling management node, and the measure of the busy degree can be conveniently and comprehensively measured and calculated from the average utilization rate of a CPU (Central processing Unit), the average network utilization rate, the current disk utilization rate, the current memory utilization rate and the like. In the mode that the access interface module directly takes over the task distribution and result summarizing processing logic, the read-write data flow does not need to pass through the scheduling management node, and the task distribution and result summarizing processing logic is transferred and uniformly distributed to each access interface module by the scheduling management node. The comprehensive management of data reading and writing is weakened, and the concurrency performance of data reading and writing is enhanced.
And 5: data redundancy and synchronization.
As shown in fig. 5 and fig. 6, data redundancy synchronization may adopt a mirror image copy mode or a multi-active mode, i.e., a primary-standby mode, a primary library and one or more secondary libraries. Each database sub-library comprises a main library and a mirror library to form a group or a chain, a write request for a certain sub-library is firstly sent to the main library in the sub-library, and the main library receives data and automatically synchronizes to the corresponding mirror library. When the master library and the mirror library are started, the state is registered and updated in real time to the scheduling management node, when database application is initially connected, the scheduling management node allocates one library from each group, if the library is connected by a writing task, the master library in the group is selected, and if the library is connected by a reading task, the sub-library with the minimum pressure is selected in each group. Each database sub-library in a multi-activity mode is divided into a plurality of peer-to-peer libraries to form a group, all the libraries in the group are mirror images, a write request for a certain sub-library can be sent to any one of the groups, the databases receive data and can be automatically synchronized to the peer-to-peer libraries, all the peer-to-peer libraries register with a scheduling management node when being started and update the state in real time, and when the databases are initially connected, the scheduling management node can select the sub-library with the minimum pressure from each group for distribution. Different modes can be flexibly adopted for metadata synchronization and data synchronization.
And 6: and unifying access interfaces.
As shown in fig. 7, the unified access interface module provides transparent read-write access to the entire time sequence database system for various applications, and the original system access interface module is multiplexed, and only one layer of the original time sequence database access interface is encapsulated, so that the communication mode and the communication protocol of the original database do not need to be changed. When the unified access interface creates connection every time, obtaining information of sub-libraries to be connected and data distribution rules in all latest sub-library groups from the scheduling management service and generating a sub-library information table (the table is bound with the connection, and the life cycle is the same as that of the connection), and simultaneously connecting all sub-library groups; when data reading and writing tasks are executed each time, the measured point data and the value data are logically split according to a sub-base group through a data distribution rule, the bottom layer respectively calls an original database interface to perform data processing, and finally all results are organized and summarized and returned to the application.
According to the embodiment, a database sub-base deployment topological graph is designed according to information such as actual hardware conditions (such as the number of servers, physical memories of the servers, CPUs (central processing units) and the like) and expected data scale, and IP information (IP addresses and ports) of each scheduling management node and each data node is configured. Establishing a data distribution rule, a common method is that an original database adopts measurement points to perform one-to-one mapping on equipment indexes, the measurement point names have uniqueness and can be directly used as measurement point attribute characteristic values, in this way, a data distribution function is degenerated to k = P (point _ name, P _ num), all data of a single measurement point are distributed into one sub-database, and an actual data distribution function can be similar to the following: k = HASH _ fun (point _ name) MOD HASH _ MOD _ BASE MOD P _ num; the HASH _ fun is a custom character string HASH function, HASH pre-allocation is carried out by taking a module of the HASH _ MOD _ BASE, and the HASH _ MOD _ BASE adopts a prime number such as 4095, so that the problem of hot spot allocation under the condition of measuring point name regularity can be effectively prevented. Automatically generating scheduling management services and starting configuration files of each database sub-library according to distribution rules and server topology, and installing and starting corresponding services (scheduling management and database sub-library services); the dispatching management service can be a single node or a cluster, the consistency of metadata in the dispatching management cluster can be maintained through zookeeper, the metadata is in a map form, key is an identifier of a sub-database, and value is sub-database metadata comprising a sub-database open connection address, a sub-database corresponding data range, a physical node position of the sub-database, a sub-database real-time state, sub-database related statistical information and the like; the real-time state of the sub-database and the relevant statistical information of the sub-database are provided when the sub-database service starts registration and need to be actively updated at regular time; because the metadata volume is small and the change is not frequent, although the data strong consistency needs to be maintained in real time, the actually consumed resources are very little. Processing data redundancy: the method can adopt a mirror image or double-active mode, generally, for metadata, the mirror image mode is more suitable, the physical consistency of the metadata can be ensured, the performance and the availability of a cluster can be improved by adopting double-active mode for data synchronization, if an original single machine time sequence library has data mirror image capacity, the data can be directly reused, and in addition, when the mirror image sub-library is started and registered, the corresponding relation between the scheduling management service report and a main sub-library is additionally required. And the query on a certain data fragment selects a main fragment or a mirror fragment with lower pressure according to the load information of real-time statistics. The method comprises the steps of packaging an access interface module, providing unified access to the whole database cluster externally, obtaining meta-information from a scheduling management cluster in real time in the module, connecting each sub-database by using a database interface of an original time sequence database at the bottom layer, splitting requests and data logically according to the access interface through a data distribution rule in the meta-information for data reading and writing, selecting proper sub-databases according to the real-time sub-database state in each sub-database group, distributing sub-requests, and still using the original database interface for all requests to the sub-databases.
Table 1 database server hardware environment configuration
Figure BDA0003994364760000091
TABLE 2 sub-site scale under each scene
Figure BDA0003994364760000092
Table 3 Performance verification Table under each scene
Scene Efficiency (ten thousand/second)
Residential electricity data writing 7008.32
Resident electricity consumption historical data query 7116.48
Freeze-on-day data write 4923.2
Daily freeze latest value query 10666.56
Taking the storage and calculation of intelligent energy, a power distribution cloud master station, integrated line loss and online national network measurement data of an Anhui electric power company as an example, a time sequence database which is deployed in a centralized mode is adopted originally, the scale of the measurement points of the database is not more than 1000 thousands, the real-time data reading performance is not more than 200 thousands of events/second, the historical data reading performance is not more than 50 thousands of events/second, the data writing performance is not more than 300 thousands of events/second, the centralized time sequence database is expanded in a distributed mode through the distributed expansion scheme, the requirements of actual measurement point scale and data reading and writing performance are met through acceptance tests, and the physical deployment, the sub-base deployment and the logical deployment of the measurement points of a specific server are shown in tables 1 to 3.

Claims (6)

1. A universal distributed expansion method for an industrial time series database is characterized by comprising the following steps:
(1) Constructing a distributed real-time database system comprising a database system access interface module, a database scheduling and managing module and a data calculation and storage module;
(2) Determining a data distribution rule: measuring points or time are used as distribution dimensions, so that the concurrency efficiency of task processing is improved to the maximum extent, and the complexity of data transmission and summarization is reduced;
(3) And (3) sub-warehouse management: the database scheduling and management module is used for processing database information management, monitoring of database state, load balancing and automatic scheduling of database resources, and considering database data migration caused by physical node change;
(4) Transferring and uniformly distributing task distribution and result summarizing processing logic to each access interface module by a scheduling management node;
(5) Data redundancy and synchronization: the reliability of the system is ensured through a data copy mode, and if the original time sequence database has a mirror image function, a main and standby mirror image mode is selected;
(6) And the unified database system access interface module provides transparent read-write access to the whole time sequence database system for various applications.
2. The universal distributed expansion method for the industrial time series database according to claim 1, characterized in that the database system access interface module provides transparent read-write access to the whole database system for various applications; the database scheduling and managing module defines a rule of data distribution, is distributed in one or more nodes, is a manager of the whole time sequence database system, monitors and manages the state and task scheduling of all the databases in real time, and provides the meta information to the database access interface module; the data calculation and storage module consists of a plurality of database sub-databases, each database sub-database provides independent calculation and storage services, and each sub-database receives or executes decomposed data or query tasks.
3. The method for universal distributed expansion of industrial time series databases as claimed in claim 1, wherein said step (2) comprises the steps of:
(21) Extracting a mapping relation between a measuring point and equipment according to a data model of an original single-machine database;
(22) Extracting attribute characteristic values of the measuring points: a coding rule f is self-defined, for a certain measuring point A, partial attributes of the measuring point A can be converted through f, f (A), namely attribute characteristic values of the measuring point A, needs to ensure the invariability and uniqueness of the characteristic values f (A), namely f (A) does not change along with the change of data collected by the measuring point A, and the uniqueness, namely for the data collected by the measuring point A, f (A) does not change along with the change of the data collected by the measuring point A
Figure FDA0003994364750000011
And &>
Figure FDA0003994364750000012
If A ≠ B, f (A) ≠ f (B);
(23) Determining related distribution dimensions including two dimensions of a measuring point attribute characteristic value and time according to the main data;
(24) Designing a distribution rule, namely defining a distribution function k = p (x, y, z), wherein x is a characteristic value f (A) of the attribute of the measuring point, y is a time point corresponding to data, z is the number of currently set data fragments, and k is a result of the data of a certain measuring point at a certain moment after the data is subjected to distribution function operation and represents the serial number of the data fragments to be classified.
4. The method for universal distributed expansion of industrial time series databases according to claim 1, wherein the step (3) is implemented as follows:
the single sub-library is only positioned at a certain physical node in the cluster, one physical node comprises a plurality of sub-libraries, and the single sub-library only processes data of one data fragment, is a relatively independent working unit and has independent running threads, cache spaces and persistent storage paths;
the number of sub-libraries is larger than the number of physical data nodes Node _ Num, the number of sub-libraries Num _ perNode which can be supported by a single physical Node to the maximum is calculated according to the occupation of resources in the running process of the original single library, and the maximum value of P _ Num is the product of Node _ Num and Num _ perNode;
with the increase of the total data amount, the capacity expansion of the physical data nodes is considered, the size of the P _ num is set to be unchanged, and the granularity of data migration is set to be the whole data fragment.
5. The method for universal distributed expansion of industrial time series databases according to claim 1, wherein the step (5) is implemented as follows:
data redundancy adopts two modes of mirror image copy or more and more: the method comprises the steps that a mirror mode, namely a main-standby mode, a main library and one or more standby libraries are adopted, each database sub-library comprises the main library and the mirror library to form a group or a chain, a write request for one sub-library is firstly sent to the main library in the sub-library, and the main library receives data and automatically synchronizes to the corresponding mirror library; registering and updating the state of a scheduling management node in real time when a main library and a mirror library are started, wherein when database application is initialized and connected, the scheduling management node allocates a library from each group, if the group is connected by a writing task, the main library in the group is selected, and if the group is connected by a reading task, the sub-library with the minimum pressure is selected in each group;
each database sub-library in a multi-activity mode is divided into a plurality of peer-to-peer libraries to form a group, all the libraries in the group are mirror images, a write request for a certain sub-library can be sent to any one of the groups, the databases receive data and can be automatically synchronized to the peer-to-peer libraries, all the peer-to-peer libraries register with a scheduling management node when being started and update the state in real time, and when the databases are initially connected, the scheduling management node can select the sub-library with the minimum pressure from each group for distribution.
6. The method for universal distributed expansion of industrial time series databases as claimed in claim 1, wherein the step (6) is implemented as follows:
when the access interface of the unified database system establishes connection every time, obtaining information of sub-libraries to be connected and data distribution rules in all latest sub-library groups from the scheduling management service and generating a sub-library information table, and simultaneously connecting all sub-library groups; when data reading and writing tasks are executed each time, the measured point data and the value data are logically split according to a sub-base group through a data distribution rule, the bottom layer respectively calls an original database interface to perform data processing, and finally all results are organized and summarized and returned to the application.
CN202211591049.XA 2022-12-12 2022-12-12 Universal distributed expansion method for industrial time sequence database Pending CN115934819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211591049.XA CN115934819A (en) 2022-12-12 2022-12-12 Universal distributed expansion method for industrial time sequence database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211591049.XA CN115934819A (en) 2022-12-12 2022-12-12 Universal distributed expansion method for industrial time sequence database

Publications (1)

Publication Number Publication Date
CN115934819A true CN115934819A (en) 2023-04-07

Family

ID=86700427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211591049.XA Pending CN115934819A (en) 2022-12-12 2022-12-12 Universal distributed expansion method for industrial time sequence database

Country Status (1)

Country Link
CN (1) CN115934819A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093643A (en) * 2023-10-18 2023-11-21 中国长江电力股份有限公司 Industrial rule platform system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093643A (en) * 2023-10-18 2023-11-21 中国长江电力股份有限公司 Industrial rule platform system
CN117093643B (en) * 2023-10-18 2024-01-05 中国长江电力股份有限公司 Industrial rule platform system

Similar Documents

Publication Publication Date Title
US8335769B2 (en) Executing replication requests for objects in a distributed storage system
CN111327681A (en) Cloud computing data platform construction method based on Kubernetes
US7590672B2 (en) Identification of fixed content objects in a distributed fixed content storage system
CN104965850B (en) A kind of database high availability implementation method based on open source technology
CN109669929A (en) Method for storing real-time data and system based on distributed parallel database
CN109933631A (en) Distributed parallel database system and data processing method based on Infiniband network
CN112199427A (en) Data processing method and system
CN103905537A (en) System for managing industry real-time data storage in distributed environment
CN113407600B (en) Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time
CN110245134B (en) Increment synchronization method applied to search service
CN109815294A (en) A kind of dereliction Node distribution parallel data storage method and system
CN109150964B (en) Migratable data management method and service migration method
CN101419600A (en) Data copy mapping method and device based on object-oriented LANGUAGE
CN115934819A (en) Universal distributed expansion method for industrial time sequence database
CN113868335A (en) Method and equipment for expanding distributed clusters of memory database
CN115587118A (en) Task data dimension table association processing method and device and electronic equipment
CN116701330A (en) Logistics information sharing method, device, equipment and storage medium
CN107908713B (en) Distributed dynamic rhododendron filtering system based on Redis cluster and filtering method thereof
US11449521B2 (en) Database management system
CN107276914B (en) Self-service resource allocation scheduling method based on CMDB
CN117271583A (en) System and method for optimizing big data query
Wang et al. Block storage optimization and parallel data processing and analysis of product big data based on the hadoop platform
CN115587147A (en) Data processing method and system
CN111258977A (en) Tax big data storage and analysis platform
CN110569310A (en) Management method of relational big data in cloud computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination