CN117891865A - Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium - Google Patents

Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium Download PDF

Info

Publication number
CN117891865A
CN117891865A CN202311769274.2A CN202311769274A CN117891865A CN 117891865 A CN117891865 A CN 117891865A CN 202311769274 A CN202311769274 A CN 202311769274A CN 117891865 A CN117891865 A CN 117891865A
Authority
CN
China
Prior art keywords
data
source
data source
caching
accessed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311769274.2A
Other languages
Chinese (zh)
Inventor
周辉
彭宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Wisesoft System Integration Co ltd
Original Assignee
Sichuan Wisesoft System Integration Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Wisesoft System Integration Co ltd filed Critical Sichuan Wisesoft System Integration Co ltd
Priority to CN202311769274.2A priority Critical patent/CN117891865A/en
Publication of CN117891865A publication Critical patent/CN117891865A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computers, in particular to a data acquisition method, a system, a device and a storage medium of a multi-source heterogeneous database. Wherein the method comprises the following steps: judging whether the accessed data source is a single data source or not; if the data source is a single data source, splitting the data source and warehousing; if the data source is a plurality of data sources, measuring and calculating the data capacity of the accessed data sources to obtain a data capacity measuring and calculating result; acquiring a server storage space; configuring a plurality of data caching methods; selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space; and carrying out association acquisition on the cached source data, and splitting and warehousing the data source. The method and the device fuse multiple caching modes and dynamically allocate the caching modes for accessing the data source, so that the instantaneity and the high efficiency of data processing are improved, and the data loss is avoided.

Description

Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a data acquisition method, a system, a device and a storage medium of a multi-source heterogeneous database.
Background
In the process of data center construction, data acquisition is an indispensable ring. The case of associative access among multi-source heterogeneous databases is one of the most common scenarios. To realize the association between multi-source heterogeneous databases, each data source data to be associated needs to be cached in the same storage, so that the multi-source data is changed into a single source data, and then the single source data is subjected to association acquisition operation, and the following two ways are adopted to realize the association access between the multi-source heterogeneous databases at present. The first is based on the data cache of the memory, extract the data of each data source into the memory separately, and buffer in the memory, then carry on the data association inquiry process based on the memory data; the second is based on the data buffer memory of the disk, is to extract the data of each data source into the disk separately, and store in the disk in the form of file, then carry on the data association inquiry process based on the disk data. The association access mode between the two multi-source heterogeneous databases has the following defects: firstly, a data caching mode based on a memory has extremely high requirements on memory hardware, and when the cached data volume is extremely large, memory use bottlenecks are easy to occur, so that the memory overflows to lose data; and the data persistence capability of the memory-based data caching mode is weak, when a server is powered off or new data is read, the data needs to be acquired from a heterogeneous source again, so that network overhead is increased, and the effectiveness of data processing is reduced. And secondly, the data caching mode based on the disk cannot achieve high efficiency and real-time performance of data reading and writing, and meanwhile, the frequent reading and writing of data also increases the system I/O pressure.
In view of this, the present application is specifically proposed.
Disclosure of Invention
The invention aims to provide a data acquisition method, a system, a device and a storage medium of a multi-source heterogeneous database, which solve the problems that the existing correlation access method between multi-source heterogeneous databases is easy to lose data, slow to read and poor in real-time performance.
The invention is realized by the following technical scheme:
in a first aspect, a method for collecting multi-source heterogeneous data is provided, including the following steps: judging whether the accessed data source is a single data source or not; if the data source is a single data source, splitting the data source and warehousing; if the data source is a plurality of data sources, measuring and calculating the data capacity of the accessed data sources to obtain a data capacity measuring and calculating result; acquiring a server storage space; configuring a plurality of data caching methods; selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space; and carrying out association acquisition on the cached source data, and splitting and warehousing the data source.
Further, before measuring and calculating the data capacity of the accessed data source, the method comprises the following steps: splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source; the method for measuring and calculating the data capacity of the accessed data source comprises the following steps: the data volume size and the character size of each field of the relational data source are counted, and the data volume size and the character size of each field of the non-relational data source are counted.
Further, the plurality of data caching methods include: memory-based data caching method, listing database-based data caching method and distributed file-based data caching method.
Further, selecting a corresponding data caching method to cache the accessed data source, including the following steps: when the data capacity measuring and calculating result is less than twenty times of the storage space of the server, selecting a data caching method based on a memory to cache the data source; when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server, selecting a data caching method based on a listing database to cache the accessed data source; when the data capacity measuring and calculating result is more than one tenth of the storage space of the server, a data caching method based on the distributed file is selected to cache the accessed data source.
In a second aspect, a multi-source heterogeneous data acquisition system is provided, comprising: the system comprises an analysis control module, a capacity measuring and calculating module, a memory acquisition module, a method configuration module, a data caching module, an association acquisition module, a segmentation and warehousing module and a data splitting module. The analysis control module is used for judging whether the accessed data source is a single data source or not; if the data source is a single data source, controlling the splitting and warehousing module to work; and if the data source is a plurality of data sources, controlling the capacity measuring and calculating module to work. The capacity measuring and calculating module is used for measuring and calculating the data capacity of the accessed data source to obtain a data capacity measuring and calculating result. The memory acquisition module is used for acquiring the storage space of the server. The method configuration module is used for configuring various data caching methods. And the data caching module is used for selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space. And the association acquisition module is used for carrying out association acquisition on the cached multi-source data. The segmentation and warehousing module is used for segmenting and warehousing the data source. The data splitting module is used for splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source.
Further, the capacity measurement module includes: a first statistical unit and a second statistical unit. The first statistics unit is used for counting the data size of the relational data source and the character size of each field. The second statistics unit is used for counting the data volume size and the character size of each field of the non-relational data source.
Further, the method configuration module includes: the device comprises a first method configuration unit, a second method configuration unit and a third method configuration unit. The first method configuration unit is used for configuring a memory-based data caching method. The second method configuration unit is used for configuring a data caching method based on the listing database. The third method configuration unit is used for configuring a data caching method based on the distributed file.
Further, the data caching module includes: a first data buffer unit and a second data buffer unit. The first data caching unit is used for selecting a memory-based data caching method to cache the data source when the data capacity measurement result is less than one twentieth of the storage space of the server. And the second data caching unit is used for selecting a data caching method based on the listing database to cache the accessed data source when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server. And the third data caching unit is used for selecting a data caching method based on the distributed file to cache the accessed data source when the data capacity measuring and calculating result is more than one tenth of the storage space of the server.
In a third aspect, a computer device is provided, comprising a memory, a processor and a transceiver in communication connection in sequence, wherein the memory is configured to store a computer program, the transceiver is configured to send and receive messages, and the processor is configured to read the computer program and perform the multi-source heterogeneous data collection method according to the first aspect.
In a fourth aspect, a computer readable storage medium is provided, on which instructions are stored which, when run on a computer, perform the multi-source heterogeneous data collection method according to the first aspect.
Compared with the prior art, the invention has the following advantages and beneficial effects: by dynamically identifying the type of the data source, different data caching modes are dynamically allocated according to the size relation between the data capacity of the data source and the memory space of the server. The column type database cache is favorable for optimizing and compressing data structure filling, and can improve cache efficiency for accessing data sources with huge data volume; the distributed file cache supports large data volume storage, is not affected by restarting of an application process, prevents data loss, does not occupy the memory space of the application process, and can keep consistency of cached data; the memory cache can be used for accessing a data source with smaller data capacity, so that the cache rate is improved. The method and the device fuse multiple caching modes and dynamically allocate the caching modes for accessing the data source, so that the instantaneity and the high efficiency of data processing are improved, and the data loss is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a multi-source heterogeneous data collection method according to an embodiment of the present invention.
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.
Examples
Aiming at the problems that the existing correlation access method between the multi-source heterogeneous databases is easy to lose data, slow to read and poor in instantaneity, the first aspect of the embodiment provides a multi-source heterogeneous data acquisition method. As shown in fig. 1, the method comprises the following steps:
step 1: judging whether the accessed data source is a single data source or not; if the accessed data source is a single data source, the accessed data source is not required to be cached, and the data source is directly acquired, segmented and put in storage. If the accessed data source is a multiple data source, step 2 is performed on the accessed data source.
It should be noted that, before the data source is accessed in step 1, an access program is configured according to the acquisition requirement, where the access program includes information such as access data source configuration, conversion component configuration, output source configuration, etc.
Step 2: and splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source.
Step 3: and measuring and calculating the data capacity of the accessed data source to obtain a data capacity measuring and calculating result. Namely, the data size of the relational data source and the non-relational data source and the character size of each field are counted respectively. For example, for a Student table being Student, there are 5 fields in the Student table, id (int, 4), name (varchar, 10), age (int, 4), six (char, 1) what email (varchar, 20), respectively; when the data amount of the student table is 100 ten thousand, the estimated storage space is about 2.8MB. The measuring and calculating method comprises the following steps: the Id field occupies space: 100 kilobytes=4 bytes=400 kilobytes; the Name field occupies space: 100 x 10 characters x 3 bytes/each = 300 kilobytes; the Age field occupies space: 100 kilobytes=4 bytes=400 kilobytes; the Sex field occupies space: 100 ten thousand x 1 bytes = 100 ten thousand bytes; the Email field occupies space: 100 x 20 characters x 3 bytes/each = 600 kilobytes.
Step 4: and acquiring server configuration information. For example, information such as the size of the remaining memory and the size of the storage space of the server is obtained. And (3) carrying out dynamic allocation of a subsequent caching method by combining the data capacity measuring and calculating result obtained in the step (3).
Step 5: a data caching method based on a memory, a data caching method based on a listing database and a data caching method based on a distributed file are configured.
Step 6: and selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space. Specific:
step 6.1: and when the data capacity measurement result is less than twenty times of the storage space of the server, selecting a memory-based data caching method to cache the data source.
Step 6.2: and when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server, selecting a data caching method based on the listing database to cache the accessed data source.
Step 6.3: when the data capacity measuring and calculating result is more than one tenth of the storage space of the server, a data caching method based on the distributed file is selected to cache the accessed data source.
Step 7: and carrying out association acquisition on the cached source data, and cutting and warehousing the data source.
Step 8: and releasing the cache after the data is put into storage.
In correspondence to the multi-source heterogeneous data collection provided in the first aspect, a second aspect of the present embodiment provides a multi-source heterogeneous data collection system, including: the system comprises an analysis control module, a capacity measuring and calculating module, a memory acquisition module, a method configuration module, a data caching module, an association acquisition module, a segmentation and warehousing module and a data splitting module. The analysis control module is used for judging whether the accessed data source is a single data source or not; if the data source is a single data source, controlling the splitting and warehousing module to work; and if the data source is a plurality of data sources, controlling the capacity measuring and calculating module to work. The capacity measuring and calculating module is used for measuring and calculating the data capacity of the accessed data source to obtain a data capacity measuring and calculating result. The memory acquisition module is used for acquiring the storage space of the server. The method configuration module is used for configuring various data caching methods. And the data caching module is used for selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space. And the association acquisition module is used for carrying out association acquisition on the cached multi-source data. The segmentation and warehousing module is used for segmenting and warehousing the data source. The data splitting module is used for splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source.
Wherein, the capacity measurement module includes: a first statistical unit and a second statistical unit. The first statistics unit is used for counting the data volume size of the relational data source and the character size of each field. The second statistics unit is used for counting the data volume size and the character size of each field of the non-relational data source.
The method configuration module comprises the following steps: the device comprises a first method configuration unit, a second method configuration unit and a third method configuration unit. The first method configuration unit is used for configuring a memory-based data caching method. The second method configuration unit is used for configuring a data caching method based on the listing database. The third method configuration unit is used for configuring a data caching method based on the distributed file.
The data caching module comprises: a first data buffer unit and a second data buffer unit. And the first data caching unit is used for selecting a memory-based data caching method to cache the data source when the data capacity measurement result is less than one twentieth of the storage space of the server. And the second data caching unit is used for selecting a data caching method based on the listing database to cache the accessed data source when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server. And the third data caching unit is used for selecting a data caching method based on the distributed file to cache the accessed data source when the data capacity measuring and calculating result is more than one tenth of the storage space of the server.
A third aspect of the present embodiment provides a computer device, including a memory, a processor and a transceiver, which are sequentially communicatively connected, where the memory is configured to store a computer program, the transceiver is configured to send and receive a message, and the processor is configured to read the computer program and perform the multi-source heterogeneous data collection method according to the first aspect.
A fourth aspect of the present embodiment provides a computer readable storage medium storing instructions comprising the multi-source heterogeneous data collection method as described in the first aspect or any of the possible designs in the first aspect, i.e. the computer readable storage medium has instructions stored thereon which, when run on a computer, perform the multi-source heterogeneous data collection method as described in the first aspect or any of the possible designs in the first aspect. The computer readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, and/or a Memory Stick (Memory Stick), where the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
The working process, working details and technical effects of the foregoing computer readable storage medium provided in the fourth aspect of the present embodiment may refer to the advertisement periodical broadcasting photographing task allocation method as described in the first aspect or any possible design in the first aspect, which is not described herein.
A fifth aspect of the present embodiments provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the multi-source heterogeneous data collection method as described in the first aspect or any of the possible designs in the first aspect. Wherein the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The multi-source heterogeneous data acquisition method is characterized by comprising the following steps of:
judging whether the accessed data source is a single data source or not; if the data source is a single data source, splitting the data source and warehousing; if the data source is a plurality of data sources, measuring and calculating the data capacity of the accessed data sources to obtain a data capacity measuring and calculating result;
acquiring a server storage space;
configuring a plurality of data caching methods;
selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space;
and carrying out association acquisition on the cached source data, and splitting and warehousing the data source.
2. The method for multi-source heterogeneous data collection of claim 1, wherein,
before measuring and calculating the data capacity of the accessed data source, the method comprises the following steps:
splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source;
the method for measuring and calculating the data capacity of the accessed data source comprises the following steps: the data volume size and the character size of each field of the relational data source are counted, and the data volume size and the character size of each field of the non-relational data source are counted.
3. The multi-source heterogeneous data collection method according to claim 1 or 2, wherein the plurality of data caching methods comprise: memory-based data caching method, listing database-based data caching method and distributed file-based data caching method.
4. A multi-source heterogeneous data collection method according to claim 3, wherein selecting a corresponding data caching method to cache the accessed data source comprises the steps of:
when the data capacity measuring and calculating result is less than twenty times of the storage space of the server, selecting a data caching method based on a memory to cache the data source;
when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server, selecting a data caching method based on a listing database to cache the accessed data source;
when the data capacity measuring and calculating result is more than one tenth of the storage space of the server, a data caching method based on the distributed file is selected to cache the accessed data source.
5. A multi-source heterogeneous data acquisition system, comprising:
the analysis control module is used for judging whether the accessed data source is a single data source or not; if the data source is a single data source, controlling the splitting and warehousing module to work; if the data source is a plurality of data sources, controlling the capacity measuring and calculating module to work;
the capacity measuring and calculating module is used for measuring and calculating the data capacity of the accessed data source to obtain a data capacity measuring and calculating result;
the memory acquisition module is used for acquiring the storage space of the server;
the method configuration module is used for configuring various data caching methods;
the data caching module is used for selecting a corresponding data caching method to cache the accessed data source according to the data capacity measuring and calculating result and the server storage space;
the association acquisition module is used for carrying out association acquisition on the cached multi-source data;
and the segmentation and warehousing module is used for segmenting and warehousing the data source.
6. A multi-source heterogeneous data collection system according to claim 5, wherein,
the system further comprises:
the data splitting module is used for splitting the accessed data source into a relational data source and a non-relational data source according to the type of the data source;
the capacity measurement module comprises:
the first statistics unit is used for counting the data size of the relational data source and the character size of each field;
and the second statistical unit is used for counting the data volume size of the non-relational data source and the character size of each field.
7. The multi-source heterogeneous data collection system of claim 5 or 6, wherein the method configuration module comprises:
a first method configuration unit for configuring a memory-based data caching method;
a second method configuration unit for configuring a data caching method based on the listing database;
and the third method configuration unit is used for configuring a data caching method based on the distributed file.
8. The multi-source heterogeneous data collection system of claim 7, wherein the data caching module comprises:
the first data caching unit is used for selecting a memory-based data caching method to cache the data source when the data capacity measurement result is less than one twentieth of the storage space of the server;
the second data caching unit is used for selecting a data caching method based on the listing database to cache the accessed data source when the data capacity measurement result is more than one twentieth of the storage space of the server and less than one tenth of the storage space of the server;
and the third data caching unit is used for selecting a data caching method based on the distributed file to cache the accessed data source when the data capacity measuring and calculating result is more than one tenth of the storage space of the server.
9. A computer device comprising a memory, a processor and a transceiver in communication connection in sequence, wherein the memory is adapted to store a computer program, the transceiver is adapted to receive and transmit messages, and the processor is adapted to read the computer program and to perform the multi-source heterogeneous data collection method according to any of claims 1-4.
10. A computer readable storage medium having instructions stored thereon which, when executed on a computer, perform the multi-source heterogeneous data collection method of any of claims 1 to 4.
CN202311769274.2A 2023-12-20 2023-12-20 Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium Pending CN117891865A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311769274.2A CN117891865A (en) 2023-12-20 2023-12-20 Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311769274.2A CN117891865A (en) 2023-12-20 2023-12-20 Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117891865A true CN117891865A (en) 2024-04-16

Family

ID=90643264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311769274.2A Pending CN117891865A (en) 2023-12-20 2023-12-20 Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117891865A (en)

Similar Documents

Publication Publication Date Title
US10296462B2 (en) Method to accelerate queries using dynamically generated alternate data formats in flash cache
US8706727B2 (en) Data compression for reducing storage requirements in a database system
US20180285167A1 (en) Database management system providing local balancing within individual cluster node
US11514028B2 (en) Hybrid data storage and load system with ROWID lookup
DE102013206744A1 (en) DEDUPLICATING STORAGE WITH IMPROVED DETECTION OF COMMON STOPS
CN110765138B (en) Data query method, device, server and storage medium
CN104462141A (en) Data storage and query method and system and storage engine device
CN111061752B (en) Data processing method and device and electronic equipment
CN107766529A (en) A kind of mass data storage means for sewage treatment industry
US9104726B2 (en) Columnar databases
CN112613271A (en) Data paging method and device, computer equipment and storage medium
US10747773B2 (en) Database management system, computer, and database management method
CN108446399B (en) Dynamic storage optimization method for structured massive real-time data
CN114398520A (en) Data retrieval method, system, device, electronic equipment and storage medium
CN113849499A (en) Data query method and device, storage medium and electronic device
CN115774699B (en) Database shared dictionary compression method and device, electronic equipment and storage medium
CN115454353B (en) High-speed writing and query method for space application data
CN117891865A (en) Multi-source heterogeneous data acquisition method, system, computer equipment and storage medium
CN105243099A (en) Large data real-time storage method based on translation document
CN110909029A (en) Method and medium for realizing cache based on Nosql
CN111782588A (en) File reading method, device, equipment and medium
CN115827653B (en) Pure column type updating method and device for HTAP and mass data
CN115905259B (en) Pure column type updating method and device supporting row-level concurrency control
CN117667595A (en) Data processing method, device, equipment and storage medium
CN117349327A (en) Memory data acquisition optimization method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination