CN116414902A

CN116414902A - Quick data source access method

Info

Publication number: CN116414902A
Application number: CN202310342953.5A
Authority: CN
Inventors: 杨铭; 戚红建; 韩硕; 王宇飞; 李伟; 刘誉杰; 邓旭楠; 唐鑫湄; 陈璐; 张明涛
Original assignee: Beijing Bidding Branch Of China Huaneng Group Co ltd; Huaneng Information Technology Co Ltd
Current assignee: Beijing Bidding Branch Of China Huaneng Group Co ltd; Huaneng Information Technology Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-07-11

Abstract

The invention discloses a rapid data source access method, which relates to the technical field of data access and comprises the steps of dividing data table types according to data attributes in a source database, and exporting a data table to be accessed into a preset format file; importing stock data of a file in a preset format into a transfer database, and importing data in the transfer database into a target database; synchronously copying the incremental data in the source database into a cache database according to a preset synchronous rate, adding an adding and deleting identification and a time stamp field, analyzing and converting the incremental data in the cache database, and accessing the incremental data into a target database according to a preset access frequency; when the source database synchronizes data to the cache database, the synchronization rate is updated in real time according to the synchronization information, and when the cache database analyzes access to the target database, the access frequency is updated in real time according to the analysis access information. The synchronous rate and the access frequency are updated in real time, so that the data access speed is improved, and the safety and stability of the data are ensured.

Description

Quick data source access method

Technical Field

The present application relates to the field of data access technologies, and in particular, to a fast data source access method.

Background

The business structured data of the existing enterprise comprises: work order class data, archive class data, operation management class data, telephone traffic class data, marketing class data, common sense class data and the like. The mass of data scale is several examples in the existing data centers at home and abroad, and the integration of the data is very challenging and creative. In the process of accessing and storing data from a source system to a data warehouse, the correctness and the integrity of the data must be ensured, and meanwhile, the use requirements of all services can be met after the data enters a new environment. Furthermore, from the perspective of optimizing storage, there should be no duplication of data storage other than the need for redundant backup.

In the prior art, because of a large number of data types and a large number of data types, the data access condition is complex, and the database cannot meet various requirements due to the adoption of fixed transmission information, so that the access speed is low.

Therefore, how to increase the data access speed is a technical problem to be solved at present.

Disclosure of Invention

The invention provides a rapid data source access method which is used for solving the technical problem of slower data access speed in the prior art. The method is applied to a system comprising a source database, a transit database, a cache database and a target database, and comprises the following steps:

in a source database, dividing the data table type according to the data attribute, exporting the data table to be accessed into a preset format file, and simultaneously recording the stock data range and distinguishing the stock data and the increment data;

importing stock data of a file in a preset format into a transfer database, importing data in the transfer database into a target database, and deleting the data in the transfer database after the data are successfully imported;

synchronously copying the incremental data in the source database into a cache database according to a preset synchronous rate, adding an adding and deleting identification and a time stamp field, analyzing and converting the incremental data in the cache database, and accessing the incremental data into a target database according to a preset access frequency;

when the source database synchronizes data to the cache database, synchronization information is acquired, the synchronization rate is updated in real time according to the synchronization information, when the cache database analyzes access to the target database, analysis access information is acquired, and the access frequency is updated in real time according to the analysis access information.

In some embodiments of the present application, after the synchronization information is acquired, the method further includes:

the synchronization information comprises average synchronization time of each piece of data;

if the average synchronization time of each piece of data is larger than a first preset time threshold value, the synchronization rate is updated in real time according to the synchronization information;

if the average synchronization time of each data is not greater than the first preset time threshold, the synchronization rate is not updated.

In some embodiments of the present application, after obtaining the resolved access information, the method further includes:

the analysis access information comprises all data analysis completion time;

if the analysis completion time of all the data is greater than a second preset time threshold value, synchronously updating the access frequency according to the analysis access information;

if the total data analysis completion time is not greater than the second preset time threshold, the access frequency is not updated.

In some embodiments of the present application, updating the synchronization rate in real time according to the synchronization information includes:

the synchronous information also comprises source end and target end server performance information, adding and deleting time and source end database inflow data, wherein the source end server performance information comprises source end average CPU utilization rate and source end memory consumption, and the target end server performance information comprises target end average CPU utilization rate and target end memory consumption;

establishing a first source end correction array according to the average CPU utilization rate of the source end, the memory consumption of the source end, the adding and deleting time and the inflow data of the source end database, and obtaining a first influence value according to the first source end correction array;

establishing a first target end correction array according to the average CPU utilization rate of the target end and the memory consumption of the target end, and obtaining a second influence value according to the first target end correction array;

a synchronization process impact value is determined based on the first impact value and the second impact value, and the synchronization rate is updated based on the synchronization process impact value and the data table type.

In some embodiments of the present application, a first source correction array is established according to a source average CPU utilization, source memory consumption, addition and deletion time, and source database inflow data, and a first impact value is obtained according to the first source correction array, including:

obtaining a plurality of impact scores based on the average CPU utilization rate of the source terminal, the memory consumption of the source terminal, the adding and deleting and modifying time, the inflow data of the source terminal database and the first preset weight;

determining the position sequence of a first source end correction array based on the magnitude relation among the influence scores, and constructing the first source end correction array according to the position sequence;

and obtaining a first influence value based on the local factor corresponding to the position sequence in the first source end correction array and the first source end correction array.

In some embodiments of the present application, a first target-side correction array is established according to an average CPU utilization rate of a target side and memory consumption of the target side, and a second impact value is obtained according to the first target-side correction array, including:

obtaining corresponding influence scores of the target terminal average CPU utilization rate, the target terminal memory consumption and the second preset weight based on the target terminal average CPU utilization rate and the second preset weight;

determining the position sequence of a first target end correction array based on the magnitude relation among the influence scores, and constructing the first target end correction array according to the position sequence;

and obtaining a second influence value based on the local factor corresponding to the position sequence in the first target end correction array and the first target end correction array.

In some embodiments of the present application, updating the synchronization rate based on the synchronization process impact value and the data table type includes:

determining endpoint values of two sides of the influence of the synchronization process according to the data table type, and selecting a plurality of preset synchronization process influence values based on the endpoint values of the two sides of the influence of the synchronization process;

and determining an update coefficient based on the relation between the synchronization process influence value and a plurality of preset synchronization process influence values, and updating the preset synchronization rate based on the update coefficient.

In some embodiments of the present application, updating the access frequency in real time according to the resolved access information includes:

the analysis access information also comprises source end and target end server performance information and source end database inflow data, the source end server performance information comprises source end average CPU utilization rate and source end memory consumption, and the target end server performance information comprises target end average CPU utilization rate and target end memory consumption;

establishing a second source correction array according to the average CPU utilization rate of the source, the memory consumption of the source and the inflow data of the source database and obtaining a third influence value;

establishing a second target end correction array according to the average CPU utilization rate of the target end and the memory consumption of the target end and obtaining a fourth influence value;

and determining an analysis access process influence value based on the third influence value and the fourth influence value, and updating the access frequency based on the analysis access process influence value and the data table type.

In some embodiments of the present application, establishing a second source-side correction array according to a source-side average CPU utilization, source-side memory consumption, and source-side database inflow data and obtaining a third impact value, and establishing a second target-side correction array according to a target-side average CPU utilization and target-side memory consumption and obtaining a fourth impact value, including:

obtaining a plurality of impact scores based on the average CPU utilization rate of the source terminal, the memory consumption of the source terminal, the inflow data of the source terminal database and the third preset weight;

determining the position sequence of a second source end correction array based on the magnitude relation among the influence scores, and constructing the second source end correction array according to the position sequence;

obtaining a third influence value based on the local factor corresponding to the position sequence in the second source end correction array and the second source end correction array;

obtaining corresponding influence scores of the target terminal average CPU utilization rate, the target terminal memory consumption and the fourth preset weight based on the target terminal average CPU utilization rate and the fourth preset weight;

determining the position sequence of a second target end correction array based on the magnitude relation among the influence scores, and constructing the second target end correction array according to the position sequence;

and obtaining a fourth influence value based on the local factor corresponding to the position sequence in the second target end correction array and the second target end correction array.

In some embodiments of the present application, updating the access frequency based on resolving the access procedure impact value and the data table type includes:

determining end point values of two sides of the influence of the analytic access process according to the data table type, and selecting a plurality of preset analytic access process influence values based on the end point values of the two sides of the influence of the analytic access process;

and determining an update coefficient based on the relation between the analysis access process influence value and a plurality of preset analysis access process influence values, and updating the preset access frequency based on the update coefficient.

By applying the technical scheme, in the source database, dividing the data table types according to the data attributes, exporting the data table to be accessed into a preset format file, and simultaneously recording the stock data range and distinguishing the stock data and the incremental data; importing stock data of a file in a preset format into a transfer database, importing data in the transfer database into a target database, and deleting the data in the transfer database after the data are successfully imported; synchronously copying the incremental data in the source database into a cache database according to a preset synchronous rate, adding an adding and deleting identification and a time stamp field, analyzing and converting the incremental data in the cache database, and accessing the incremental data into a target database according to a preset access frequency; when the source database synchronizes data to the cache database, synchronization information is acquired, the synchronization rate is updated in real time according to the synchronization information, when the cache database analyzes access to the target database, analysis access information is acquired, and the access frequency is updated in real time according to the analysis access information. According to the method and the device, when incremental data are accessed, whether the access parameters need to be updated is judged according to the standard, the synchronous rate and the access frequency are updated in real time, the data access speed is improved, and the safety and stability of the data are guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow diagram of a fast data source access method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an access of stock data according to another embodiment of the present invention;

fig. 3 shows a schematic diagram of incremental data access according to another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The embodiment of the application provides a rapid data source access method, which is applied to a system comprising a source database, a transfer database, a cache database and a target database, as shown in fig. 1, and comprises the following steps:

in step S101, in the source database, the data table types are divided according to the data attributes, the data table to be accessed is exported as a preset format file, and the stock data range is recorded and the stock data and the incremental data are distinguished.

In this embodiment, the data attribute includes three types of data including simple growth, full update and short-term archiving, and there are three types of data tables corresponding to the data attributes.

Step S102, importing stock data of the file in the preset format into a transfer database, importing data in the transfer database into a target database, and deleting the data in the transfer database after the importing is successful.

Step S103, synchronously copying the incremental data in the source database into a cache database according to a preset synchronous rate, adding an adding and deleting identification and a time stamp field, analyzing and converting the incremental data in the cache database, and accessing the incremental data into the target database according to a preset access frequency.

In this embodiment, the synchronization rate and the access frequency are both fixed values, and the update operation is performed along with the subsequent judgment.

Step S104, when the source database synchronizes data to the cache database, synchronization information is obtained, the synchronization rate is updated in real time according to the synchronization information, when the cache database analyzes access to the target database, analysis access information is obtained, and the access frequency is updated in real time according to the analysis access information.

In this embodiment, the synchronization rate and the access frequency are updated in real time, so as to ensure rapid access of data.

the analysis access information comprises all data analysis completion time;

To increase the data access rate, in some embodiments of the present application, updating the synchronization rate in real time according to the synchronization information, including: the synchronous information also comprises source end and target end server performance information, adding and deleting time and source end database inflow data, wherein the source end server performance information comprises source end average CPU utilization rate and source end memory consumption, and the target end server performance information comprises target end average CPU utilization rate and target end memory consumption; establishing a first source end correction array according to the average CPU utilization rate of the source end, the memory consumption of the source end, the adding and deleting time and the inflow data of the source end database, and obtaining a first influence value according to the first source end correction array; establishing a first target end correction array according to the average CPU utilization rate of the target end and the memory consumption of the target end, and obtaining a second influence value according to the first target end correction array; a synchronization process impact value is determined based on the first impact value and the second impact value, and the synchronization rate is updated based on the synchronization process impact value and the data table type.

In this embodiment, in the process of synchronizing data from the source database to the cache database, synchronization information is obtained, where the synchronization information is some influencing factors that influence the synchronization speed. The method comprises the steps of source end and target end server performance information, adding and deleting and modifying time, source end database inflow data and the like. The method is divided into two parts, namely a source end influence and a target end influence, wherein the source end influence comprises source end server performance information, adding and deleting time and source end database inflow data. The target side impact includes target side server performance information. And establishing a first source end correction array according to the source end influence, establishing a first target end correction array according to the target end influence, so as to obtain a first influence value and a second influence value, determining a synchronization process influence value (total influence) based on the first influence value and the second influence value, and updating the synchronization rate based on the synchronization process influence value and the data table type.

In order to further improve data synchronization efficiency, in some embodiments of the present application, a first source correction array is established according to a source average CPU utilization, source memory consumption, addition and deletion time, and source database inflow data, and a first impact value is obtained according to the first source correction array, including: obtaining a plurality of impact scores based on the average CPU utilization rate of the source terminal, the memory consumption of the source terminal, the adding and deleting and modifying time, the inflow data of the source terminal database and the first preset weight; determining the position sequence of a first source end correction array based on the magnitude relation among the influence scores, and constructing the first source end correction array according to the position sequence; and obtaining a first influence value based on the local factor corresponding to the position sequence in the first source end correction array and the first source end correction array.

In this embodiment, for example, the impact scores corresponding to the average CPU utilization (percentage), the memory consumption (percentage), the adding/deleting/modifying time, and the source database inflow data are S1, S2, S3, and S4, respectively, and if the magnitude relation is sequentially reduced, S1 > S2 > S3 > S4, the position sequence is S1 first bit, S2 second bit, S3 third bit, and S4 fourth bit. The first source correction array is (S1, S2, S3, S4), the corresponding prime factors of the first source correction array are (α1, α2, α3, α4), and the first influence value=α1s1+α2s2+α3s3+α4s4.

It should be noted that the following technical features are the same, and the following description is omitted.

In some embodiments of the present application, a first target-side correction array is established according to an average CPU utilization rate of a target side and memory consumption of the target side, and a second impact value is obtained according to the first target-side correction array, including: obtaining corresponding influence scores of the target terminal average CPU utilization rate, the target terminal memory consumption and the second preset weight based on the target terminal average CPU utilization rate and the second preset weight; determining the position sequence of a first target end correction array based on the magnitude relation among the influence scores, and constructing the first target end correction array according to the position sequence; and obtaining a second influence value based on the local factor corresponding to the position sequence in the first target end correction array and the first target end correction array.

In this embodiment, the total influence value (the influence value of the synchronization process) is determined based on the first influence value and the second influence value, and the specific means is not limited herein, and is within the scope of protection of the present application as long as the integrity can be expressed.

In order to improve the reliability of synchronization, in some embodiments of the present application, updating the synchronization rate based on the synchronization process influence value and the data table type includes: determining endpoint values of two sides of the influence of the synchronization process according to the data table type, and selecting a plurality of preset synchronization process influence values based on the endpoint values of the two sides of the influence of the synchronization process; and determining an update coefficient based on the relation between the synchronization process influence value and a plurality of preset synchronization process influence values, and updating the preset synchronization rate based on the update coefficient.

In this embodiment, the three data table types are respectively corresponding to different synchronization process influence two side end point values, and the synchronization process influence two side end point values are equivalent to interval values. The method comprises the following steps:

for example, the synchronization process corresponding to the simply growing data table affects both side end points A11 and A22, A11-A22 being a section,

setting the influence value of the synchronous process as A, and presetting an array A0 (A1, A2, A3 and A4) of the influence value of the synchronous process, wherein A1, A2, A3 and A4 are all preset values, and A11 is more than A1 and less than A2 and less than A3 and less than A4 and less than A22;

setting a preset synchronous rate as V, and presetting an update coefficient array F0 (F1, F2, F3 and F4), wherein F1, F2, F3 and F4 are all preset values, and F1 is more than 0.6 and less than F2 and F3 is more than 0 and less than 1.4;

determining an update coefficient according to the relation between the influence value of the synchronization process and each preset influence value of the synchronization process to obtain an updated synchronization rate;

if A is less than A1, determining a first preset updating coefficient F1 as an updating coefficient, wherein the updated synchronization rate is V.times.F1;

if A1 is less than or equal to A2, determining a second preset updating coefficient F2 as an updating coefficient, wherein the updated synchronization rate is V x F2;

if A2 is less than or equal to A3, determining a third preset updating coefficient F3 as an updating coefficient, wherein the updated synchronization rate is V x F3;

if A3 is less than or equal to A4, determining a fourth preset updating coefficient F4 as an updating coefficient, wherein the updated synchronization rate is V x F4.

It can be appreciated that the following technical features are the same, and the following description is omitted.

From the above description of the embodiments, it will be clear to those skilled in the art that the present invention may be implemented in hardware, or may be implemented by means of software plus necessary general hardware platforms. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective implementation scenario of the present invention.

In order to further explain the technical idea of the invention, the technical scheme of the invention is described with specific application scenarios.

The processing data mainly comprises the following three types:

1) Simply grow classes, such tables have only insert operations. Once the data is inserted into the database, no changes will occur. Such tables are mainly: real-time growing traffic tables, regularly growing archive tables, irregularly growing or non-growing regular tables.

2) The full update class, the data table has all the operations of adding and deleting. These operations may be real-time, timed or non-timed in terms of insertion and deletion, but in terms of modification, there is only real-time and non-timed modification, and no timed modification operation.

3) Short-term archival class, the data is used for generating batch adding and deleting operations at regular time without modifying operations. Taking a work order table as an example, a record table of a certain service work order can be stored in three tables: on-line worksheet forms with real-time insert, modify and timed delete operations belong to the full update class; a long-term filing table only has timed insertion operation, and belongs to the simple growth class; short-term archive forms, in which only timed insert and delete operations exist, work order records completed in the online form will be deleted from the online form and inserted into the short-term archive form, while records exceeding the time limit in the short-term archive form will be deleted and inserted into the long-term archive form.

The method is divided into two stages, one is the access of stock data, and the other is the access of incremental data.

Access of stock data:

the stock data only needs to be accessed once. Considering the huge stock, if the ETL mode is directly adopted, the efficiency is too low and the stability is poor in the process of data transmission in different places. Therefore, we choose to export the stock data from the source in the form of files before importing it into the local system where the data warehouse is located. Because the source end and the target end are heterogeneous databases and the source end only allows the dmp file to be exported according to the operation rule of the production environment, a small Oracle database needs to be deployed locally in a data warehouse to serve as a data transfer station. The dmp file exported by the source is imported into the Oracle, and then the full data is extracted from the Oracle to the data warehouse by using the ETL mode.

The process is shown in fig. 2, and the steps are as follows:

1) The source end exports the table to be accessed into a dmp file, and simultaneously records the range of the stock data, and the range is used as the basis for distinguishing the stock data from the incremental data.

2) And sending the exported dmp file to a server where the transfer database of the target end is located in a file transmission mode.

3) And importing the dmp file into a target terminal transfer database.

4) And importing the data in the target-side transfer database into an MPP data warehouse by using an ETL mode.

5) And deleting the data in the transit database after the successful completion of the importing.

6) Repeating the steps 1) to 5) until all the access of the inventory data is completed.

ETL is a data extraction, conversion, loading mode. Firstly, data is read from a source end in a data extraction stage, the process uses a reading mode suitable for a source end database to read, and the structured databases such as Oracle, mysql and the like which are commonly used at present are generally read in a jdbc mode. After the data is read out, the source end data is processed and converted properly in the data conversion stage, so that the data meets the storage requirement of the target end, and finally the processed data is loaded to the target end database in the data loading stage.

The access of incremental data needs to take into account the frequency of access. For transactional queries, the data requirements are real-time or near real-time (on the order of minutes), but the data time range of the query is small, typically not more than 1 year. For the scenes of offline computation such as BI analysis, report computation, query analysis, data mining, business prediction and the like, the time interval is usually calculated according to days or months as long as the incremental data meets the computation interval of the offline computation, but the time range of the queried data is larger, and the maximum can reach the full-scale level. In view of such data access requirements, we use Mysql, in which 1 year data is cached to support the requirements of transactional queries, as the incremental data is accessed into the cache database prior to the MPP data warehouse. We synchronize data from Oracle to Mysql using OGG mode, and then extract incremental data from Mysql into the MPP data warehouse based on the minimum time interval of the offline computing scenario using ETL mode.

The incremental data process is as follows:

as shown in the figure 3 of the drawings,

1) Synchronously copying data in the source Oracle to Mysql in an OGG mode, and adding an adding and deleting and modifying identifier and a time stamp field in the OGG process.

2) And periodically analyzing and converting the data in an ETL mode and storing the data in an MPP data warehouse.

It will be appreciated that the specific means or databases described above may be adapted or modified as appropriate to the particular situation.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The fast data source access method is applied to a system comprising a source database, a transit database, a cache database and a target database, and is characterized by comprising the following steps:

2. The method of claim 1, wherein after acquiring the synchronization information, the method further comprises:

3. The method of claim 1, wherein after obtaining the resolved access information, the method further comprises:

the analysis access information comprises all data analysis completion time;

4. The method of claim 2, wherein updating the synchronization rate in real time based on the synchronization information comprises:

5. The method of claim 4, wherein establishing a first source correction array based on the source average CPU utilization, the source memory consumption, the addition and deletion time, and the source database inflow data, and obtaining a first impact value based on the first source correction array, comprises:

6. The method of claim 4, wherein establishing a first target-side correction array based on the target-side average CPU utilization and the target-side memory consumption, and obtaining a second impact value based on the first target-side correction array, comprises:

7. The method of claim 5 or 6, wherein updating the synchronization rate based on the synchronization process impact value and the data table type comprises:

8. The method of claim 1, wherein updating the access frequency in real time based on the parsed access information comprises:

9. The method of claim 8 wherein establishing a second source correction array based on the source average CPU utilization, the source memory consumption, and the source database inflow data and obtaining a third impact value, and establishing a second target correction array based on the target average CPU utilization and the target memory consumption and obtaining a fourth impact value, comprises:

10. The method of claim 9, wherein updating the access frequency based on resolving the access procedure impact value and the data table type comprises: