CN113407600A

CN113407600A - Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time

Info

Publication number: CN113407600A
Application number: CN202110947193.1A
Authority: CN
Inventors: 刘军华; 吴名朝
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-09-17
Anticipated expiration: 2041-08-18
Also published as: CN113407600B

Abstract

The invention discloses an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time, which comprises the following steps: building a distributed dynamic table management assembly; creating a distributed dynamic table, and standardizing a data structure of a multi-source table through the distributed dynamic table; managing synchronous updating of metadata change information of the distributed dynamic table through a real-time synchronization technology; initializing batch import distributed dynamic table data; checking the integrity of the imported data; updating data in the distributed dynamic table in real time; carrying out synchronous monitoring on real-time data in the synchronous data updating process; converting data in the distributed dynamic table into a virtual table of real-time stream data through a structured query language; performing combined flow calculation on data in the virtual table and pre-configured flow data; and outputting the calculation result. Has the advantages that: the real-time computing requirement of complex business logic under a billion large table by combining real-time computing with an external data source is met.

Description

Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time

Technical Field

The invention relates to the technical field of big data calculation, in particular to an enhanced real-time calculation method for dynamically synchronizing multi-source big table data in real time.

Background

In the real-time computing process, complex processing requirements such as completion of data, data filtering, data conversion and the like exist in a plurality of real-time services, data required by complex logics such as completion, filtering conditions, conversion and the like are usually stored outside the real-time computing process, external different data sources are required to be connected to pull the data in the real-time computing process, a large amount of data usually contains billions of large table data, and the performance of the data pulling process completely depends on the performance of a source end data source, so that great performance influence exists in the real-time computing process.

With the popularization of real-time computing services, telecom operators and various enterprise units have more and more abundant requirements on real-time computing service scenes, and there are many scenes that require real-time data to be combined with existing offline database table data for computing, and the following scenes are used for complementing the real-time data, such as real-time data recording: pkey =1, fkey =101, udata = 200.; the acquired data udatetype = 'sensitive data' of the offline base table needs to be queried according to fkey =101, and finally a real-time data record output is formed:

pkey =1, fkey =101, udata =200, udatetype = 'sensitive data';

the existing technologies for implementing this service scenario mainly include:

1) and if the off-line base table is a structured database, such as MySQL, oracle, PostgreSQL, hive and the like, the off-line base table can be connected through a standard interface jdbc/odbc, and the data of the table can be read by using SQL immediately, such as SQL: select udatetype (field) from db.t1 (library. table) where fkey =101 (index field as query condition); data in which the value of the udatetype field is the complement;

2) if the off-line base table is a semi-structured or unstructured database such as hbase, hdfs, clickhouse and the like, the operation such as real-time data calculation and the like is carried out after data is read in real time through the api in the connection mode provided by the off-line base table;

3) loading the data of the offline table into a specified memory in advance, and reading the data from the memory during calculation to calculate or complement the data;

the prior art mostly depends on the processing capacity of different types of databases in order to meet the performance requirement of real-time calculation on the second level, and the combined calculation of real-time data and offline data mainly has the following defects:

1) if 100 or more real-time data records are processed simultaneously, the offline database needs to be connected for more than 100 times or more, and then great access pressure is caused to the offline database;

2) the performance requirement of real-time calculation at the second level is difficult to meet, and due to the difference of different databases, the capability of reading the data records of the over-hundred million tables and responding at the second level is difficult to guarantee, so that the real-time calculation is delayed, and a large amount of real-time data is accumulated;

3) when the memory data is read, the correctness of the data is difficult to guarantee, meanwhile, over-hundred million table data can not be completely loaded into the memory, and data calculation errors can be caused because the data is preloaded into the memory and the data changed by an off-line table cannot be synchronized to the memory in time;

4) the formats of multi-source data are not uniform, the difficulty of calculation butt joint is increased invisibly, and because different data sources such as hbase, oracle and the like have different data storage formats, when the calculation is combined with real-time data, each operator needs to be developed respectively aiming at different data sources, so that the research and development cost and the realization complexity are increased invisibly.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

In the field of real-time computing, a plurality of services are required to be computed together with data in a multi-source off-line base table, the data volume of the multi-source off-line base table is relatively large, for example, hundreds of millions of data can be stored in single tables of oracle and hbase, the direct reading performance is difficult to guarantee, the requirement of real-time second-level computing performance is met, excessive access pressure on the base table is avoided, and aiming at the problems in the related technology, a distributed dynamic table is designed for synchronously loading the data of the off-line base table data, an interface of the multi-source base table is standardized, the accuracy and performance requirements of real-time computing are met, and the technical problems in the prior related technology are solved.

Therefore, the invention adopts the following specific technical scheme:

an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time comprises the following steps:

s1, building a distributed dynamic table management assembly through a single-node storage engine based on a virtual node consistency hash algorithm and a structured query language;

s2, creating a distributed dynamic table in the distributed dynamic table management component through the structured query language, and standardizing the data structure of the multi-source table through the distributed dynamic table;

s3, managing the synchronous update of the metadata change information of the distributed dynamic table through a real-time synchronization technology;

s4, initializing batch import distributed dynamic table data;

s5, checking the integrity of the imported data;

s6, updating the data in the distributed dynamic table in real time;

s7, carrying out synchronous monitoring on real-time data in the synchronous updating process of the data;

s8, converting the data in the distributed dynamic table into a virtual table of real-time stream data through a structured query language;

s9, performing combined flow calculation on the data in the virtual table and the pre-configured flow data;

and S10, outputting the calculation result.

Further, the building of the distributed dynamic table management component through the single-node storage engine based on the virtual node consistency hash algorithm and the structured query language further comprises the following steps:

s11, realizing the balanced distribution storage of the data based on the virtual node consistency hash algorithm;

and S12, analyzing the structured query language, converting the structured query language into an information code capable of acquiring the base table, and reading and writing data through a distribution algorithm and an application program interface of the single-node storage engine.

Further, the implementation of the balanced distribution storage of data based on the virtual node consistent hash algorithm further includes the following steps:

s111, abstracting the whole hash space into a virtual ring;

s112, when the access routing is carried out on the value of the hash function, the value is firstly routed to a virtual node, and then the virtual node finds a real node;

s113, virtualizing P physical nodes on the virtual ring, virtualizing N virtual nodes from each physical node, and randomly mapping the total virtual nodes to the virtual ring;

s114, storing and acquiring data;

the total number formula of the virtual nodes is as follows:

the total number of virtual nodes (M) = the number of physical nodes (P) × the number of virtual nodes (N).

Further, the managing the synchronous update of the metadata change information of the distributed dynamic table by the real-time synchronization technology further includes the following steps:

s31, changing fields;

s32, changing the primary key and the index field;

wherein the field change comprises a field code and a field data type change.

Further, the initializing the batch import distributed dynamic table data further includes the following steps:

s41, according to the obtained library table name, obtaining the metadata information of the library table from the memory;

s42, correspondingly setting a plurality of domain values into a hash table by using a library entry statement of the distributed dynamic table management component;

and S43, saving the data by using the partition strategy and forming a table data storage format.

Further, the checking the integrity of the imported data further comprises the following steps:

s51, providing batch import data, and recording the total successful records and the total failed records;

and S52, acquiring the record number of the reading source table, and comparing the record number with the imported data record.

Further, the real-time updating of the data in the distributed dynamic table further includes the following steps:

s61, connecting the database to obtain the position successfully analyzed last time;

s62, establishing connection with a database, sending a DUMP protocol instruction, and simulating a mode of synchronous data;

s63, the offline data starts to push the update log data of newly added data, modified data and deleted data;

s64, the real data analyzer receives the log data, calls a corresponding log analyzer according to the log type to analyze the protocol and supplement related information, and distributes and sends the analyzed data to a corresponding real-time data receiver through a base table and the hash value of the main key;

s65, the real-time data receiver receives the data and stores the data;

s66, guaranteeing the data by using the transaction mechanism of the real-time data receiver;

and S67, after the data updating and storing are successful, updating the position data of the log.

Further, the synchronous monitoring of the real-time data in the synchronous data updating process further includes the following steps:

s71, providing the total number of the records which are changed in real time and written successfully and the total number of the failed records, and acquiring the total number of the records according to the time period;

s72, recording the number of records which change in real time, and acquiring the total number of records according to the time period;

and S73, providing the capability of data complementation, and realizing data complementation for the failed write-in repeat message.

Further, the virtual table for converting the data in the distributed dynamic table into the real-time stream data through the structured query language further comprises the following steps:

s81, writing and reading a structured query language of the distributed dynamic table;

s82, analyzing and checking the compliance of the structured query language through a parser and a verifier;

s83, splitting the structured query language into operators with codes capable of being written, identifying distributed dynamic tables and related field information, analyzing fields and values, and performing fuzzy query according to a table data storage format;

and S84, ensuring the consistency of the virtual table name and the distributed dynamic table name through an application program interface and a transaction mechanism, and realizing the conversion of the data of the distributed dynamic table into stream data.

Further, the combining stream calculation of the data in the virtual table and the preconfigured stream data further includes the following steps:

s91, reading data of the virtual table and the flow table by using a structured query language of the real-time computing engine;

s92, loading the data of the virtual table as required, analyzing the condition of loading the data of the virtual table, and loading the data from the distributed dynamic table to the memory of the real-time computing engine in an asynchronous multithreading mode according to the condition;

s93, in the virtual table data loading process, starting an asynchronous multi-concurrency mode to transmit data, and dynamically distributing transmitted concurrency number according to the data size;

and S94, splitting into concurrent fetching ends according to the table grouping, the data reading range and the idle condition of the cluster resources during loading, and performing concurrent data reading.

The invention has the beneficial effects that:

1. by utilizing the combination of the current excellent open source technology, the open source component is optimized and modified, and in a proper link, the technology is skillfully utilized, large table data is loaded by layers and layers step by step, the large table data is loaded as required, the large table is effectively disassembled into small tables as required, and meanwhile, the high-efficiency characteristic of a memory is fully and effectively utilized, the large table storage data component is provided, the structure of the multi-source table data is standardized, the performance of real-time calculation is accelerated, and the problem of the pain of the calculation performance of real-time services on the large storage table is effectively solved; the data of the source end is synchronized by using a batch and push protocol mode, the access pressure to a source end base table is avoided, the service scene requirement of the combination calculation of most real-time streams and offline data can be met, the service range supported by real-time calculation is widened, and the real-time calculation requirement of complex service logic under the condition that the real-time calculation is combined with an external data source over hundred million tables is well met.

2. The real-time computing engine Flink based on the current mainstream can still meet the real-time performance requirement when computing when the stream data is combined with the large and small table data of a multi-source external data source, meet the complex service processing capacity of large concurrency and high throughput real-time data processing, mainly solve the second-level performance requirement when the stream computing and the large table of the external data source carry out combined computing, mainly comprise large data components such as hbase, hive and hdfs and the like, relational databases such as mySQL and orance and non-structural databases such as ES and the like, and simultaneously can synchronously participate in the combined computing in real time when the stream computing and the external source data computing need to meet the incremental data of the external source, thereby ensuring the performance and the correctness in the process of real-time stream complex service computing.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow diagram of a method for enhanced real-time computation of dynamic real-time synchronization of multi-source large table data according to an embodiment of the present invention;

FIG. 2 is a diagram of a flow calculation process of a distributed dynamic table in an enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention;

FIG. 3 is a consistent Hash diagram after a virtual node in an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the invention;

FIG. 4 is a table data storage format diagram of an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention;

FIG. 5 is a diagram of an index storage format in a method for enhanced real-time computation of dynamic real-time synchronization of multi-source large table data according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a method for real-time updating a distributed dynamic table by modified data in an enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a process of converting dynamic table data into stream data in an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a process of calculating flow data and virtual tables in an enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention;

FIG. 9 is a performance diagram of a distributed dynamic table in an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to the embodiment of the invention, an enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time is provided.

Referring to the drawings and the detailed description, the invention will be further explained, as shown in fig. 1-2, in an embodiment of the invention, a method for enhanced real-time computation of dynamic real-time synchronization of multi-source large table data, the method includes the following steps:

specifically, in the distributed dynamic table management component, a distributed dynamic table is created through a simple SQL (structured query language) statement, and the database table data structure of the multiple sources is standardized through the dynamic table. Creating a distributed dynamic table that can interface with multiple types of off-line libraries such as: oracle, PostgrESQL, Godendb, oceanddb, Tidb, ClickHouse, etc. type of table, create dynamic table statements such as:

create table rdm _ [ library name ] [ dynamic table name ] (field name type, field name type., PRIMARY KEY (primary key));

with（dbtype=’oracle’，db=’’，table=’’，dataSourceCode=’’）；

the table structure data is stored in MySQL storage media, wherein,

rdm _: fixed format, the representation is a distributed dynamic table;

PRIMARY KEY: specifying a primary key of a table;

dbtype: a database type representing docking synchronization data;

db: the library name of the butt joint is represented, the library name in the multi-source is mainly used for butt joint, the filling can be omitted, and the default value of the system is def;

table: the table names of the butt joint are expressed, the names of the table of the multi-source type are butt joint, and the table names of the table of the same source are kept consistent as much as possible, so that the corresponding relation between the table and the source table can be better managed;

DataSourceCode: and data source codes representing the configuration, and information such as data source connection and the like which can be related to the configuration through the codes.

An index can be created for the query condition, and the query performance is improved, as follows:

create index on rdm _ [ library name ] [ dynamic table name ] (field name, field name.);

and the storage of metadata information is realized in the distributed dynamic table management component, so that the distributed dynamic table management component is convenient to use in real-time data development, management and operation and maintenance.

In addition, in order to realize the standardization of the data of the multi-source base table in the real-time calculation process and simultaneously combine the characteristics of the distributed dynamic table, the data storage requirements of real-time second-level quick query, ten-million or hundred-million large tables and the like need to be met, therefore, the metadata distributed quick storage is selected in the design so as to be convenient for quickly loading the metadata, the metadata information is stored in the distributed dynamic table management component, and the multi-source base table data can be quickly used in the modes of real-time task development, operation and maintenance, management, audit and the like.

The creating library of the metadata uses Call to analyze SQL through a create table statement of SQL, and then splits the metadata into a key-value form to be stored in a rocksdb component, such as put (rdm _ [ library name ] [ dynamic table name ], field set);

to reduce write I/O interaction of metadata with a storage device, consider the metadata:

1, metadata information is relatively dispersed, random operation is more, if the metadata information is directly written into a hard disk, the performance is poorer than that of sequential writing, so that the random sequence change is required to be realized: the distributed dynamic table management component uses journaling techniques to first write metadata into a separate storage area, called journal, before submitting the metadata to the backing store. A random write first writes a continuous journal, thus changing random operation to sequential operation and improving performance.

2, metadata is frequently modified, and if the metadata is modified once, the performance is also affected by writing once to a hard disk, so that the implementation of write merging is required: and writing the changed data into an area of the memory by using a journal technology, and then, destaging the dirty metadata corresponding to the journal in the memory. Even if metadata of a certain library table is continuously modified for 10 times, the disk is dropped for 1 time, and the disk dropping efficiency is improved.

specifically, when metadata such as fields, data types, primary keys and indexes of the multi-source table are changed, the distributed dynamic table management component can be timely notified to update the metadata information through a real-time synchronization technology (CDC), and meanwhile, the correctness of real-time data processing is guaranteed.

S4, initializing batch import distributed dynamic table data;

specifically, at the initial stage of creating the dynamic table, the table does not have any data, at this time, data in the offline table needs to be loaded into the dynamic table in batch, batch loading usually occurs before real-time calculation starting, because the data volume of some tables is too large, the time consumption of batch importing is long, various strategies are adopted for the batch to achieve rapid importing of data of the large table, and the batch importing can be configured according to needs.

S5, checking the integrity of the imported data;

specifically, in the batch synchronization process of data, due to the existence of the stability of a network, a source, a real-time computing end, and the like, data loss in the data initialization process may be caused, and a function of checking imported data is provided herein (specifically, see below).

S6, updating the data in the distributed dynamic table in real time;

specifically, when INSERT (newly added Data), UPDATE (modified Data) and DELETE (deleted Data) operations occur to Data in a source terminal base table, the Data changed in an offline base table is captured in real time through an asynchronous CDC (Change Data Capture) technology, the captured Data is synchronized to other databases (isomorphic or heterogeneous) or application systems, the main synchronous Data is converted and stored for a distributed dynamic table, the Data in the dynamic table is updated in time, and the correctness of the Data is guaranteed through short transactions.

and S10, outputting the calculation result.

Specifically, an example shows the process of generating and using the distributed dynamic table according to the multi-source large table, and the example shows that the performance of the distributed dynamic table can completely meet the second-level performance requirement of real-time calculation on the multi-source large table.

In one embodiment, the building of the distributed dynamic table management component by the single-node storage engine based on the virtual node consistent hash algorithm and the structured query language further includes the following steps:

Specifically, a distributed dynamic table data storage component is built, unified management and storage of data of a multi-source database type are achieved, a standard data access mode is provided for real-time calculation, high-efficiency read-write performance of real-time tasks is met, distributed extended and reduced capacity storage is supported, storage requirements of hundred million-level large tables are met, a convenient and simple SQL operation support is provided, and the distributed dynamic table management component can be accessed and used quickly.

The distributed dynamic table management component is based on a rocksDB storage engine, and the rocksDB is an embedded key-value storage engine, wherein the key and the value can be any byte streams without size limitation. Joining distributed technologies, such as RPC (remote procedure call), is one of the most basic technologies in a distributed system for intercommunication between nodes in the distributed system.

The RocksDB is a single-node storage engine, has good and efficient storage performance, can meet the second-level performance requirements of real-time calculation data storage and calculation, utilizes the good performance of the RocksDB storage engine, is based on the research on the current distributed storage system architecture and related technologies, and is distributed and balanced for storage through a virtual node consistent Hash algorithm, simple and easy SQL data efficient reading and writing, and a distributed dynamic table management component capable of providing efficient, stable and reliable persistent storage service is designed and realized.

The data are effectively stored in a balanced mode by utilizing a consistent Hash data distribution strategy, and the significance of consistency is as follows: when the nodes are added or deleted, only one or two nodes adjacent to the changed nodes are affected, other parts of the hash table are kept unchanged, the hash function of the consistent hash is irrelevant to the number of the nodes, and the expansion and contraction of the nodes can be facilitated, and the data of the large distributed storage table can be conveniently stored.

In one embodiment, the implementing of the balanced distribution storage of data based on the virtual node consistent hashing algorithm further comprises the following steps:

s111, abstracting the whole hash space into a virtual ring;

specifically, assuming that the value space of the hash function is 0- (2 ^ 32-1), namely a 32-bit unsigned integer, the value can be adjusted according to the actual deployment scale, and defaults to: 1024 (total number of virtual nodes M).

specifically, as shown in fig. 3, a1, B1, E1, C2, a2, D1, C1, D2, B2, and E2 are all virtual nodes, and the physical nodes are only five, i.e., A, B, C, D and E, physical node a is responsible for storing data on virtual nodes a1 and a2, physical node B is responsible for storing data on virtual nodes B1 and B2, physical node C is responsible for storing data on virtual nodes C1 and C2, physical node D is responsible for storing data on virtual nodes D1 and D2, and physical node E is responsible for storing data on virtual nodes E1 and E2.

specifically, a certain number of virtual nodes P are virtualized on the virtual ring and are uniformly distributed, so that an avalanche phenomenon will not occur, each virtual node corresponds to only one physical node, each physical node is virtualized to N virtual nodes (the number of virtual nodes can be appropriately assigned to N values according to the performance condition of the physical machine, the default is 3), and then the virtual nodes are mapped onto the ring at random.

Generating virtual nodes on the virtual ring according to the physical nodes by calculating the modulo of the virtual total number M, wherein the virtual nodes need to be generated circularly by the value of N:

loop (i = 0; < N; i + +) retaining pocket

Vcode = hash (md 5 (random code + hostName + i))% M

}

Wherein:

vcode: a value of a virtual node;

n: the number of virtual nodes of a single physical node;

m: total number of virtual nodes;

i: an auto-increment variable;

random code: a number within the virtual node M generated by a random method;

the hostName: representing a physical node name;

md 5: hashing the virtual nodes through an md5 algorithm;

hash: and taking the hashed hash value.

The relationship between the virtual node and the physical node is stored through the hash table according to the result calculated by the formula, the version number is increased to manage modified information, the hash table can be changed due to the addition or deletion of the data nodes in the cluster, and cached data can be out of date, so the version number is added to the hash table in the design, the version number is updated each time the data hash table is changed, the client side can compare the version number with the version number of the stored hash table after receiving the version number, and if the version number of the hash table stored at present is lower, the latest hash table is initiatively requested from the control node to be cached to the local.

S114, storing and acquiring data;

specifically, the client acquires a data hash table from the control node, where the table records the address of the data node stored in each virtual node. And the client performs hash (key)% M (virtual total) operation, namely MurmurHash is performed on the key value of the data to generate a hash value, and the total number M of the virtual nodes is subjected to remainder to obtain the serial number of the virtual nodes. And finally, obtaining a data node address list where the virtual node is located by searching the hash table. The client can then access the data node of the virtual node to store the data in the rocksdb corresponding to the data node. The mapping between keys to virtual nodes is fixed, but the mapping between virtual nodes to data nodes cannot be fixed, possibly for the following reasons: node failure is not available; nodes are added or deleted. When these situations occur, the mapping relationship from the virtual node to the data node needs to be reconstructed, i.e. the hash table needs to be reconstructed.

The total number formula of the virtual nodes is as follows:

Specifically, in distributed storage, data is distributed to a plurality of storage nodes, and three purposes need to be achieved: the distribution is uniform, namely the data volume on each storage node is as close as possible; load balancing, namely the request quantity on each storage node is as close as possible; the data migration generated during capacity expansion is as little as possible.

Then, consistent hashing of the virtual nodes is introduced, and the significance of the consistency is as follows: when a node is added or deleted, only one or two nodes adjacent to the changed node are affected, and the other part of the hash table is kept unchanged, which can be understood to be: the hash function of the consistent hash is independent of the number of nodes.

In addition, the SQL statements analyzed by callite are as follows: table (field 1, field 2.) -select f1, f 2.. from detected b. table, converting into code to obtain information such as table name, field, source end data, and the like, and then realizing data read-write operation through a distribution algorithm and api of rocksdb;

apache Call is an open source SQL analysis tool, various SQL statements can be analyzed into abstract Syntax AST (abstract Syntax Tree), algorithms and relations to be expressed in SQL can be embodied in specific codes by operating AST, SQL inside Flink is calculated in real time by adopting Call, and SQL capability of Call on a distributed dynamic table management component is mainly added in Flink.

(1) Mainly adding an analysis factory of a dynamic table management component in a factory class of the Call;

(2) analyzing SQL sentences, and splitting SQL into corresponding functions of the distributed dynamic table management components;

(3) optimizing related SQL operation mainly optimizes the common associated operation logic execution planning stage,

and if the number of result lines of the associated (join) right node is less than the threshold of 100 ten thousand, broadcasting and distributing the data of the right node, and putting the data into a memory for calculation because the number of result lines is less.

When association (join) operation is combined, sort is carried out before combination, and join (association) is carried out after combination, so that the performance of join calculation is improved.

In one embodiment, the managing synchronous updating of distributed dynamic table metadata change information by a real-time synchronization technique further comprises the steps of:

s31, changing fields;

s32, changing the primary key and the index field;

wherein the field change comprises a field code and a field data type change.

Specifically, the metadata of the dynamic table is synchronously updated when the field codes and the data types of the fields are changed; the change of the primary key and the index field is related to the change of the key value of the stored data, and a metadata mapping relation table is adopted to uniformly manage the change of the primary key and the index field.

In addition, here mainly realize the synchronous renewal to metadata change information, through real time monitoring technique (CDC), monitor to the change of bank table structure data, notify distributed dynamic table management component in real time, here through:

1) the online analysis engine LogRulESplit can support analysis of a log analyzer developed by python, jar files and Dsl rule language;

2) the method for realizing the change of the primary key and the index field has the advantages of minimizing the cost and meeting the requirement of quick change, the primary key change and the key external key table field need to be changed synchronously under the normal condition to be effective, so that the element number of the whole library table is traversed and then modified, the time consumption cost is 0 (n), the metadata change and change time of the primary key is recorded in the form of a metadata pull chain table, and the capacity of automatically converting the mapping field is provided during query to realize the change cost of the primary key of 0 (1).

As shown in fig. 4 and 5, in one embodiment, the initializing the bulk import distributed dynamic table data further comprises the following steps:

s41, according to the obtained library table name, obtaining the metadata information of the library table from the memory; the method comprises the steps of connecting information such as a main key, an index and attributes, and then reading corresponding data by connecting a Reader plug-in to a data source;

s42, correspondingly setting a plurality of field-values into a hash table (key) by using a put key field value … of the distributed dynamic table management component;

s43, storing data by using a partition strategy, forming a table data storage format (using the assembly key = rdm _ [ library name ] _ [ table name ] _ [ primary key value 1], obtaining a value of field value corresponding to the key, storing the data by using put, if index data is created, simultaneously assembling the indexed key, rdm _ [ library name ] _ [ table name ] _ [ index field name ] _[ index field value ], wherein the value is a key value of the primary table, rdm _ [ library name ] _ [ table name ] _ [ primary key value ], and storing the data by using put).

Specifically, data can be imported according to needs, a specified account period can be selected, and account period data only needing to participate in calculation is imported; configuring multiple channels, and quickly importing data in a concurrent multithreading mode; the initial import can realize the capabilities of discontinuous import, re-import and data supplement.

In addition, the distributed dynamic table management component is an efficient cache, has the characteristics of a memory database and very fast response, supports the persistence of data, can store the data in the memory in a disk, can be loaded again for use when being restarted, has the speed far exceeding that of the database, has extremely high performance, can read 110000 times/s and write 81000 times/s, and can effectively improve the performance of the system.

In one embodiment, said checking the integrity of the imported data further comprises the steps of:

Specifically, the component is implemented by initializing data volume records, and is mainly implemented as follows:

1) after multi-source table data are read, the total reading number and time of each reading are recorded in real time by a read hook (callback) method, the total reading number and the time are stored in a distributed dynamic table management component, key = rdm _ [ library name ] _ [ table name ] _ READER _ [ serial number ], and the saved value is 'total number (ReadRecords) and occurrence time (op _ time)';

2) after data is written into the distributed dynamic table management component, recording the number of successful records, the number of failed records and time of each write in real time by using a hook (callback) method, wherein key = rdm _ [ library name ] _ [ table name ] _ write _ [ sequence number ], and the saved values are 'successful records (SucESsrecords), failed records (FailRecords) and occurrence time (op _ time)';

3) after the importing is finished, the total number of data is counted automatically according to rdm _ [ library name ] _ [ table name ] _ READER and key = rdm _ [ library name ] _[ table name ] _ WRITER, and the completeness of the imported data can be compared immediately so as to confirm whether the data is redirected or not and guarantee the successful importing of the effective and complete data.

As shown in fig. 6, in an embodiment, the updating the data in the distributed dynamic table in real time further includes the following steps:

s61, connecting the database to obtain the position successfully analyzed last time (if the position is started for the first time, obtaining the initially appointed position or the log position of the current database);

s64, the real data analyzer receives the log data, calls a corresponding log analyzer according to the log type to analyze the protocol and supplement related information (supplement some specific information, supplement field name, field type, main key information, index information, and signed type processing), and the analyzed data is sent to a corresponding real-time data receiver through the base table and the hash value of the main key;

s65, the real-time data receiver receives data and stores the data (the data of the same table is transmitted to the same RealDatasink module instance through an arrangement mechanism to ensure the sequence of the data);

s66, the data is guaranteed by using a transaction mechanism of a real-time data receiver (RealDatasink realizes data storage of the data, and is a blocking operation until the data is successfully stored);

Specifically, when data is updated synchronously in real time, several transaction mechanisms are provided for meeting the data correctness participating in real-time calculation, as follows:

1) read uncommitted: if one transaction has started writing data, another transaction does not allow simultaneous write operations, but allows other transactions to read the row of data, the isolation level may be implemented by an "exclusive write lock," but does not exclude read threads. This avoids a loss of updates, but dirty reads may occur, i.e., transaction B reads uncommitted data from transaction A, keeping the write data one time, but it may occur that the read data is not the most current data.

2) Repeated reading: the repeatable read means that the same data is read for multiple times in one transaction, and when the transaction is not finished, other transactions cannot access the data (including read and write), so that the data which can be read twice in the same transaction is the same, so that the data is called the repeatable read isolation level, the transaction for reading the data prohibits the write transaction (but allows the read transaction), and the write transaction prohibits any other transaction (including read and write), so that the repeatable read and dirty read are avoided. The correctness of data participating in calculation is guaranteed, when the data changes, the data row records are locked according to the value of the main key field of the data, the row is guaranteed not to be read and written any more, and the data can be read and written again after the operation is finished.

In one embodiment, the synchronous monitoring of the real-time data of the data synchronous updating process further includes the following steps:

Specifically, in the real-time synchronization process of the data, due to the network and other reasons, the synchronization recording fails, and the real-time data monitoring capability is provided.

In the real-time synchronization process of data, in order to guarantee the effectiveness of data synchronization, the effect of data synchronization can be monitored in real time, and the method mainly comprises the following steps:

1) accumulating the recorded data and the accumulated time in real time;

2) recording the data quantity of abnormal or dirty records and detailed information;

3) recording abnormal logs in the synchronization process, feeding back abnormal conditions in the synchronization data process in real time by differentiating the abnormal logs, and analyzing and extracting data in the abnormal logs by using rules and regular expressions due to the fact that the log format is quite fixed.

As shown in fig. 7, implementing the application of the dynamic table in stream computing, and the SQL mechanism based on flink can read the data of the distributed dynamic table, in an embodiment, the virtual table for converting the data in the distributed dynamic table into real-time stream data through the structured query language further includes the following steps:

s81, writing and reading a structured query language of the distributed dynamic table; SQL such as correlation can be included, the query condition of the dynamic table participating in real-time calculation according to the characteristics of the service scene needs to be a main key or an index field, and fuzzy and accurate query of the main key or the index can be supported;

s83, splitting the structured query language into operators with codes capable of being written, identifying distributed dynamic tables and related field information, analyzing fields and values, and performing fuzzy query according to a table data storage format; analyzing SQL, splitting the SQL into operators with programmable codes, identifying rdm _ initial dynamic tables and related fields, analyzing fields and values through the where condition, assembling a series of keys or key fuzzy queries according to the previously stored data format, and acquiring key data of a system queried through hmdet (key, f1, f 2) or a fuzzy mode SCAN secure [ MATCH pattern ] [ COUNT COUNT ], wherein the pattern indicates the key fuzzy rules such as key, and the like;

Loading data from a dynamic table to a virtual table of a memory as required, wherein the virtual table utilizes a flink window or the memory to store the data as required, the virtual table is automatically created by a system, the table name is consistent with the dynamic table, when generating virtual table data in a table format, the transaction mechanism mentioned above is required to be utilized to ensure that the data does not read dirty data during updating, the virtual table name is the same as the distributed dynamic table name, so that the data of the distributed dynamic table is converted into stream data, and in the using process, step S5 introduces a method for loading the data in detail.

By utilizing the capability of calcite for optimizing the performance of SQL, the SQL executed by assembly is optimized, and the Optimizer mainly aims to reduce the data volume processed by SQL, reduce the consumed resources and improve the SQL execution efficiency to the maximum extent, such as: pruning useless columns, merging projections, converting sub-queries into JOINs, reordering JOINs, push down projections, push down filters, etc. Currently, there are two main types of optimization methods: syntax-based (RBO) and cost-based (CBO) optimizations.

RBO (rule Based optimization) is popular in that a series of rules are defined in advance and then an execution plan is optimized according to the rules. If a RelNode tree:

LogicalFilter

LogicalProject

LogicalTablEScan

it can be optimized to:

LogicalProject

LogicalFilter

LogicalTablEScan

FilterJoinRule

the use scenario of Rule is that Filter is above Join, and the Filter can be done first and then Join can be done to reduce the number of joins, and so on, and there are many similar rules.

CBO (cost Based optimization) calculates the cost of all possible execution plans of SQL through a certain algorithm, and selects an execution plan with lower cost, generally speaking, RBO cannot judge which execution plan is better optimized, and only calculates the cost of each JOIN method respectively, and selects the best one. And executing the SQL by a real-time computing engine of the flink according to the optimized SQL, and generating a real-time data result.

Specifically, through the SQL statement, the data of the distributed dynamic table may be continuously queried, and the data of the distributed dynamic table may be loaded as needed and converted into stream data, where the SQL is written as follows: select r1.name, r1.price from rdm _ [ library name ] [ distributed dynamic table name ] where [ index field or primary key field ];

wherein:

rdm _: the representation is a read distributed dynamic table;

[ library name ]: to distinguish from which source library the data is synchronized;

[ distributed dynamic table name ]: table names created through the create table generally require to be consistent with the table names of the source end;

[ where ]: locating a specific record primarily for filtering of the primary key field and index field of the table;

and loading the data of the dynamic table through SQL, and then converting the data into a virtual table of real-time stream data, wherein the virtual table is automatically created by the system, the table name is consistent with the dynamic table, and meanwhile, the data consistency is controlled through transactions in the process of reading the data.

As shown in fig. 8 and 9, in an embodiment, the combining the data in the virtual table with the preconfigured stream data further includes:

s91, reading data of the virtual table and the flow table by using a structured query language of the real-time computing engine; reading data of the virtual table and the flow table by using a Flink SQL, for example, reading an association calculation SQL between the virtual table (rdm _ db.dim1) and the flow table (flow.t 1), calculating data by using an operator capability of the Flink:

select a.pkey，a.fkey，a.udata，b.udatetype from flow.t1 a， rdm_db.dim1 b where a.fkey=b.fkey；

s92, loading the data of the virtual table as required, analyzing the condition of loading the data of the virtual table, and loading the data from the distributed dynamic table to the memory of the real-time computing engine in an asynchronous multithreading mode according to the condition; loading data of the virtual table as required, analyzing the condition for loading the data of the virtual table, and loading the data from the distributed dynamic table to a memory of a real-time calculation engine in an asynchronous multithreading mode according to the condition;

s93, in the virtual table data loading process, starting an asynchronous multi-concurrency mode to transmit data, and dynamically distributing transmitted concurrency number according to the data size; in the process of loading the virtual table data, an asynchronous multi-concurrency mode is started to transmit the data, the transmitted concurrency number is dynamically distributed according to the size of the data volume, and the performance of transmitting the data is improved;

and S94, splitting into concurrent fetching ends according to the table grouping, the data reading range and the idle condition of the cluster resources during loading, and performing concurrent data reading. During loading, the data are grouped according to the table, split into appropriate concurrent access terminals according to the reading range of the data and the idle condition of the cluster resources, and read the data concurrently, so that the data in the previous step can be transmitted quickly in real time.

Specifically, in order to realize the combination calculation of the stream Data and the virtual table, various stream calculation applications are developed by fully utilizing a Flink SQL or api, the Flink is the most suitable Data ProcESsing (Data ProcESsing) with low time delay, has the characteristics of high concurrent Data ProcESsing, reliability and the like, and the calculation of the stream Data and the virtual table is realized through the 5 steps shown in the following fig. 8, so that the real-time calculation second-level requirement of the complex service logic combining the billions of large tables of external sources is realized. Reading the data of the virtual table by using the SQL of the flink through the virtual table created in the previous step, and simultaneously performing calculation such as association, aggregation, comparison and the like on the data of the virtual table and the data of the stream, wherein the step of participating in the calculation is as follows:

1) by utilizing the existing capability of the flink SQL, the data of the virtual table can be directly read, for example, as shown in FIG. 8, 3 SQL statements listed in the step 1 represent correlation calculation, aggregation calculation, value taking and the like respectively;

2) similarly, by using the existing capability of the flink, various types of calculations are performed on the data (virtual data) of the virtual table (virtual table) and the data (flow data) of the flow table (flow table), such as association (join), aggregation (Aggregate), value taking (GetOneRecord), and the like, where it is required to ensure that the data of the virtual table arrives, the data arrival is realized by the following steps, and the execution plan of the SQL is optimized by using the optimization capability of the calcite in the above-mentioned flink, so that the calculation performance is accelerated;

3) loading data of a virtual table as required, analyzing SQL by using the above-mentioned calcite, splitting SQL of a virtual table part, converting an api form such as hmdet (key, f1, f 2) because the virtual table is queried accurately and mainly through the fuzzy of a main key or index (see step S4), calculating the number of access records to be generated in a pre-access form (single data, a plurality of records with fixed data quantity, a record with associated query and the like) to split the task of access data, and splitting the task of access data statements into a plurality of small subtasks so as to be executed concurrently.

The pre-judging number record quantity is based on:

single data, multiple fixed data volume records:

records = count (split (dataset, separator));

description of the drawings:

the separator defaults to ',';

splitting the data set according to the specified separators by a split method, and then calculating the number of records by using count;

and (3) the unknown data quantity such as the record of the associated query:

records = number of history records or similar number of table records (N) × (1 + 10%) + log 2N;

description of the drawings: estimating 10% error of data according to historical record number or similar table, if there is no historical record number or similar table data, setting default record 5 ten thousand according to experience, and adding a data expansion value of log 2N;

4) the judgment of the transmission mode in the step 3) is based on that in the virtual table data loading process, an asynchronous multi-concurrency mode is started to transmit the data of each table, the transmitted concurrency number is dynamically distributed according to the data volume of the table, and the data transmission performance is improved;

the judgment basis of the concurrency number of data transfer dynamic allocation transfer is as follows:

concurrency per table = max (reords/100/response duration, average cpu core) concurrency rate of return;

wherein records represent the number of records, response duration represents the duration of data loading, which is usually within the second level, and meanwhile, in order to guarantee performance, the system has the least available hardware resources (average cpu core number, which is generally 8 cores), and the concurrent yield rate represents the ratio of available hardware resources, which is defaulted to 0.6 according to an empirical value;

5) calculating the judgment basis of the concurrency of the data, grouping according to the table during loading, splitting into proper concurrency terminals according to the reading range of the data and the idle condition of the cluster resources, and reading the data concurrently, thereby ensuring that the data in the previous step can be transmitted rapidly in real time;

number of packets = count (number of tables);

in order to ensure that the read data can be transmitted in time, a scheme of one-to-one reading and transmitting channels is adopted, namely: the concurrency number of each table fetch = the concurrency number of each table transfer data; each Reader task is started by Group, and after the Reader is started, the Reader fixedly starts threads of Reader-transfer data-virtual table to complete task synchronization work.

Specifically, the output stream of step S10 is treated as follows:

for example, the data of a natural person of a certain telecommunication company is normalized, certain electronic channel data and the data of a self system of a certain electricity set are collected in real time, and the data are compared and updated according to the following update rules:

and (3) under the condition that the certificate types and the certificate numbers of the clients are consistent, the updating rule of the inconsistent client names comprises the following steps:

level 1 rule: the order source, the order of a certain type of electronic channel > the stock system to update the customer name;

level 2 rule: the certificate validity period is longer, and the name of the client is updated;

level 3 rules: whether a user exists or not is judged, and the client name is updated when the user is larger than the user;

aiming at the above requirements, the implementation steps are as follows:

1) creating a client information table and a user table of the distributed dynamic table;

2) initializing and loading client information tables (hundred million level) and data (hundred million level) of user tables of the distributed dynamic tables;

3) configuring real-time data synchronization tasks of a source table and a distributed dynamic table;

4) the real-time service logic development is realized through SQL:

select client name, certificate validity period, order source from rdm _ xxtel. custinfo (client information table in dynamic table) where cert _ no (certificate number) = [ value ] (specific value) and cert _ type (certificate type) = [ value ] (specific value);

select count (1) from rdm _ xxtel. userinfo (user information table in dynamic table) where list _ id (client primary key) = [ value ] (specific value);

5) and (3) judging the electronic channel data by the flow data access and the 4) joint calculation:

if: the order source = electronic channel, some type of electronic channel order, update the customer name of the stock system;

if: the electronic channel, the certificate validity period > the customer information table in the dynamic table, the certificate validity period and the customer name of the updating stock system;

if: in the user information table in the dynamic table, the number of users is less than 0, and the client name of the stock system is just updated;

description of the drawings: in this case, data in the dynamic table is used for real-time calculation for many times, as in the step 4) 5), and business logic needs to be judged according to a customer information table (hundred million-level large table) and a user information table (hundred million-level large table) of the stock system.

As shown in fig. 9, the performance is as follows: (the data of reading and writing the distributed dynamic table can reach the second level, and the real-time calculation requirement is completely met).

For the convenience of understanding the technical solutions of the present invention, the following detailed description will be made on the working principle or the operation mode of the present invention in the practical process.

In practical application, the real-time computing completion performance meets the real-time computing requirement: in the real-time data relocation process, because modified business logic only transmits modified fields, most of the fields need to be completed by associating with external data sources, and the external data sources such as big data components hbase and clickhouse, relational databases mySQL and oracle, non-structural databases and the like meet the requirements of real-time computing performance when acquiring data.

The complex filtering condition meets the real-time calculation requirement: in the real-time calculation process, the existing data conditions are required to be jointed to filter the data of the real-time stream, at the moment, external data still needs to be connected, whether the stream data meets the filtering conditions is judged through the externally accessed data, and the problems of reading the performance of an external data source, limiting the connection data and the like are solved.

The complex business logic conversion meets the real-time calculation requirement: in the calculation process of real-time calculation, data is required to be converted from one state to another state through service logic, for example, data of an external data source needs to be deeply involved in the real-time calculation such as aggregation calculation and state value conversion, the read-write performance requirements of the external data source are very high, and the requirement that the external data can calculate the second level in real time is required.

Supporting various external data sources, adding enhanced real-time calculation, and meeting the real-time calculation requirement of the second level: external data sources such as big data components hbase, hive, hdfs, kudu, clickhouse, relational databases mySQL, oracle, postgreSQL, etc., unstructured databases ES, etc.

In summary, by means of the above technical solutions of the present invention, the combination of the current excellent open source technologies is utilized, the open source components are optimized and modified, and in a proper link, the technology is skillfully utilized to load large table data step by step and layer by layer, load large table data as required, effectively disassemble the large table into small tables as required, and simultaneously, the high efficiency characteristics of the memory are fully and effectively utilized to provide large table storage data components, standardize the structure of the multi-source table data, accelerate the performance of real-time computation, and effectively solve the problem of painful computation performance of real-time services on the large inventory tables; the data of the source end is synchronized by using a batch and push protocol mode, the access pressure to a source end base table is avoided, the service scene requirement of the combination calculation of most real-time streams and offline data can be met, the service range supported by real-time calculation is widened, and the real-time calculation requirement of complex service logic under the condition that the real-time calculation is combined with an external data source over hundred million tables is well met. The real-time computing engine Flink based on the current mainstream can still meet the real-time performance requirement when computing when the stream data is combined with the large and small table data of a multi-source external data source, meet the complex service processing capacity of large concurrency and high throughput real-time data processing, mainly solve the second-level performance requirement when the stream computing and the large table of the external data source carry out combined computing, mainly comprise large data components such as hbase, hive and hdfs and the like, relational databases such as mySQL and orance and non-structural databases such as ES and the like, and simultaneously can synchronously participate in the combined computing in real time when the stream computing and the external source data computing need to meet the incremental data of the external source, thereby ensuring the performance and the correctness in the process of real-time stream complex service computing.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An enhanced real-time computing method for dynamically synchronizing multi-source large table data in real time is characterized by comprising the following steps of:

s4, initializing batch import distributed dynamic table data;

s5, checking the integrity of the imported data;

s6, updating the data in the distributed dynamic table in real time;

and S10, outputting the calculation result.

2. The method for dynamically synchronizing the enhanced real-time calculation of the multi-source large table data in real time according to claim 1, wherein the building of the distributed dynamic table management component through the single-node storage engine based on the virtual node consistent hash algorithm and the structured query language further comprises the following steps:

3. The method of claim 2, wherein the step of implementing the uniform distribution storage of data based on the virtual node consistency hash algorithm further comprises the steps of:

s111, abstracting the whole hash space into a virtual ring;

s114, storing and acquiring data;

the total number formula of the virtual nodes is as follows:

4. The method of claim 1, wherein managing the synchronous update of the metadata change information of the distributed dynamic table through a real-time synchronization technique further comprises:

s31, changing fields;

s32, changing the primary key and the index field;

wherein the field change comprises a field code and a field data type change.

5. The method of claim 1, wherein initializing batch import distributed dynamic table data further comprises:

6. The method of claim 1, wherein the checking the integrity of the imported data further comprises the steps of:

7. The method of claim 1, wherein the real-time updating of the data in the distributed dynamic table further comprises:

s65, the real-time data receiver receives the data and stores the data;

8. The method of claim 1, wherein the step of synchronously monitoring real-time data during the data synchronous update process further comprises the steps of:

9. The method of claim 1, wherein the step of converting the data in the distributed dynamic table into the virtual table of the real-time stream data through the structured query language further comprises the following steps:

10. The method of claim 1, wherein the performing the combined stream calculation on the data in the virtual table and the pre-configured stream data further comprises: