CN109426438B - Real-time big data mirror image storage method and device - Google Patents

Real-time big data mirror image storage method and device Download PDF

Info

Publication number
CN109426438B
CN109426438B CN201710771908.6A CN201710771908A CN109426438B CN 109426438 B CN109426438 B CN 109426438B CN 201710771908 A CN201710771908 A CN 201710771908A CN 109426438 B CN109426438 B CN 109426438B
Authority
CN
China
Prior art keywords
data
mirror image
storage
cache
smaller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710771908.6A
Other languages
Chinese (zh)
Other versions
CN109426438A (en
Inventor
涂锋
尹启禄
顾学伟
王建宏
刘钰柏
黄志豪
刘忱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Guangdong Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201710771908.6A priority Critical patent/CN109426438B/en
Publication of CN109426438A publication Critical patent/CN109426438A/en
Application granted granted Critical
Publication of CN109426438B publication Critical patent/CN109426438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a real-time big data mirror image storage method and a device, the method carries out data splitting on original data acquired from a real-time data source, carries out mirror image processing such as rearrangement, screening and deletion on the split data according to actual service requirements, and finally stores the data after mirror image processing, thereby reducing data redundancy and improving data availability. In addition, the method provided by the embodiment of the invention can also be used for carrying out inspection analysis on the cache data subjected to mirror image processing and the storage data finally stored in the specified path, and the storage can be finished only when the error between the cache data and the original data is small, so that the accuracy of the storage data can be increased, and powerful support is provided for the later data analysis.

Description

Real-time big data mirror image storage method and device
Technical Field
The embodiment of the invention relates to the technical field of software, in particular to a real-time big data mirror image storage method and device.
Background
With the rapid development of internet technology, big data has become a hot topic, especially for operators and large internet companies, data is growing in the magnitude of PB every day, and in order to respond to the call of parties and governments, each related enterprise vigorously develops the big data application industry, builds its own big data analysis and processing platform, and performs storage, analysis, application and the like of big data. In practical big data applications, the data acquisition is very real-time, for example: the real-time performance of the signaling data acquisition of an operator and the log data acquisition of an internet company ranges from minute to second, and the real-time data can be large data application with high real-time performance requirements, such as: the urban thermodynamic diagram brings the promotion of application accuracy and quality, so how to better store and analyze the acquired data, reduce the time from acquisition to storage to application, ensure the accuracy of the data, and is a problem to be solved urgently.
The current popular big data platform is mainly based on an open-source hadoop platform, and big data is stored through a Hadoop Distributed File System (HDFS). For the storage of real-time big data, the data is generally received, serialized and compressed, and then sequentially stored in a local file system as small files, after the absolute position of the small file is determined, the relative position of the small file is recalculated to be added as a big file to ensure the integrity of the file, and the big file can still be divided, and then the small file is asynchronously added into the HDFS.
However, in the process of implementing the invention, the inventor finds that the existing scheme has the following problems:
1. the data redundancy is large, after the data storage is completed, the subsequent data analysis application needs to perform a large amount of original processing on the original data, remove useless information and then use the useless information for analysis, and a large amount of useful computing resources are consumed;
2. the data missing possibility is high, and due to the fact that the data content is not checked after being stored, partial data can be missed and cannot be found, and the later-stage data analysis is inaccurate.
Disclosure of Invention
The embodiment of the invention provides a real-time big data mirror image storage method and device, which are used for overcoming the defects of large data redundancy and easy data loss of the existing big data storage method.
In a first aspect, an embodiment of the present invention provides a real-time big data mirror storage method, including:
receiving a real-time data source;
performing row-column splitting on original data in the real-time data source to obtain the original data record number of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
if the size of the cache variable reaches a set value, judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value;
if the number of the cache data in the cache variable is smaller than the preset value, storing the cache data in the cache variable into a storage file according to a specified configuration path, and recording the number of the stored data records in the storage file;
judging whether the error between the number of the cached data records and the number of the stored data records is smaller than a preset threshold value or not; and if the storage file is smaller than the preset storage file, sending the storage file to an external distributed storage system for storage.
Optionally, the mirroring processing on the original data according to a preset mirroring algorithm to obtain a mirrored data result includes:
loading a data mirror configuration table;
and mirroring the row and column data of each row in the original data according to the row data mirroring mapping relation configured in the configuration table to obtain a mirrored data result.
Optionally, the method further comprises:
acquiring the resource condition of a local system, and calculating the current resource load value of the local system;
if the resource load value of the native system is greater than a first threshold value, reducing a data mirroring processing queue;
if the resource load value of the native system is smaller than a second threshold value, adding a data mirror processing queue;
wherein the first threshold is greater than the second threshold.
Optionally, the method further comprises:
acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
if the resource load value of the external distributed storage system is larger than a third threshold value, reducing a data mirror image storage queue;
if the resource load value of the external distributed storage system is smaller than a fourth threshold value, adding a data mirror image storage queue;
wherein the third threshold is greater than the fourth threshold.
In a second aspect, an embodiment of the present invention provides a real-time big data mirroring storage device, including:
the data receiving module is used for receiving a real-time data source;
the data mirror image processing module is used for splitting rows and columns of original data in the real-time data source to obtain the number of original data records of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
the data checking module is used for judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value or not if the size of the cache variable reaches a set value;
the data mirror image storage module is used for storing the data in the cache variables into a storage file according to a specified configuration path and recording the number of stored data records in the storage file if the judgment is smaller than the preset value;
the data checking module is further configured to determine whether an error between the number of cached data records and the number of stored data records is smaller than a preset threshold; and if the storage file is smaller than the preset storage file, sending the storage file to an external distributed storage system for storage.
Optionally, the data mirroring processing module is further configured to:
loading a data mirror configuration table;
and mirroring the row and column data of each row in the original data according to the row data mirroring mapping relation configured in the configuration table to obtain a mirrored data result.
Optionally, the apparatus further comprises a computing resource monitoring module configured to:
acquiring the resource condition of a local system, and calculating the current resource load value of the local system;
if the resource load value of the native system is greater than a first threshold value, reducing a data mirroring processing queue;
if the resource load value of the native system is smaller than a second threshold value, adding a data mirror processing queue;
wherein the first threshold is greater than the second threshold.
Optionally, the apparatus further comprises a computing resource monitoring module configured to:
acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
if the resource load value of the external distributed storage system is larger than a third threshold value, reducing a data mirror image storage queue;
if the resource load value of the external distributed storage system is smaller than a fourth threshold value, adding a data mirror image storage queue;
wherein the third threshold is greater than the fourth threshold.
In a third aspect, a further embodiment of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the program.
In a fourth aspect, a further embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to the first aspect.
The embodiment of the invention provides a real-time big data mirror image storage method and a device, the method carries out data splitting on original data acquired from a real-time data source, carries out mirror image processing such as rearrangement, screening and deletion on the split data according to actual service requirements, and finally stores the data after mirror image processing, thereby reducing data redundancy and improving data availability. In addition, the method provided by the embodiment of the invention can also be used for carrying out inspection analysis on the cache data subjected to mirror image processing and the storage data finally stored in the specified path, and the storage can be finished only when the error between the cache data and the original data is small, so that the accuracy of the storage data can be increased, and powerful support is provided for the later data analysis.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a real-time big data mirror storage method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for storing a real-time big data mirror image according to an embodiment of the present invention;
fig. 3 is a schematic diagram of splitting and mirroring the original data according to the embodiment of the present invention;
FIG. 4 is a flowchart of a method for monitoring a local system and an external distributed storage system according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an embodiment of a real-time big data mirroring storage device according to the present invention;
FIG. 6 is a schematic structural diagram of an embodiment of a real-time big data mirroring storage device according to the present invention;
fig. 7 is a block diagram of an embodiment of a computer device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a first aspect, an embodiment of the present invention provides a real-time big data mirror storage method, as shown in fig. 1, including:
s101, receiving a real-time data source;
s102, performing row-column splitting on original data in the real-time data source to obtain the original data record number of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
s103, if the size of the cache variable reaches a set value, judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value;
s104, if the number of the cache data in the cache variable is smaller than the preset number, storing the cache data in the cache variable into a storage file according to a specified configuration path, and recording the number of the stored data records in the storage file;
s105, judging whether the error between the cached data record number and the stored data record number is smaller than a preset threshold value or not; and if the storage file is smaller than the preset storage file, sending the storage file to an external distributed storage system for storage.
The embodiment of the invention provides a real-time big data mirror image storage method, which is used for splitting data of original data acquired from a real-time data source, carrying out mirror image processing such as rearrangement, screening and deletion on the split data according to actual service requirements, and finally storing the data after mirror image processing, thereby reducing data redundancy and improving data availability. In addition, the method provided by the embodiment of the invention can also be used for carrying out inspection analysis on the cache data subjected to mirror image processing and the storage data finally stored in the specified path, and the storage can be finished only when the error between the cache data and the original data is small, so that the accuracy of the storage data can be increased, and powerful support is provided for the later data analysis.
To facilitate an understanding of the method provided by the above examples, an alternative implementation of the various steps in the method is described in detail below with reference to fig. 2.
And S101, receiving a real-time data source.
Specifically, the method may include:
(1) starting N (for example, 10) data receiving thread queues according to the system configuration data;
(2) each thread is in butt authentication with the data source server according to the data source configuration so as to be used for subsequently adapting the received external real-time data, such as KAFKA (open source real-time data transmission software) interface data, FTP interface data, file data or other interface data sources can be added;
(3) each thread queue receives an external real-time data source;
(4) checkpoint 1 is marked for data verification and checkpoint 1 is set to the variable CHECKPOINTDATA 1.
The real-time data source here may be: operator network signaling, internet company application system logs and the like, wherein the data content is mainly text, the data format is mainly according to line data, and each line of data is spaced by the same separator, such as:
line 1: a1, A2, A3, A4, A5\ r \ n
Line 2: b1, B2, B3, B3, B5\ r \ n
……
Where "\\ r \ n" is an autonomously definable row delimiter and "," is an autonomously definable row data field delimiter.
S102, splitting the original data in the real-time data source in a row-column mode to obtain the original data record number of the original data; and carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache file, and recording the number of cache data records in the cache file.
Specifically, the method may include:
(1) starting N (for example, 10) data mirroring thread queues according to the system configuration data;
(2) referring to fig. 3, after receiving data, each thread queue first performs data splitting on the data, and the splitting is performed first by rows and then by columns.
The line splitting is to divide the lines by line separators and store them in a line data array variable RowData [ n ], and record the number of data lines, which is understood to be the number of original data records, and then accumulate the number of original data records in CHECKPOINTDATA 1. For the example shown in fig. 3, chekpointdata 1 is 3.
When column splitting is performed, the row data array RowData [ n ] is read first, and a group of data (i.e. a row) is taken each time and split according to the configured column separators. Taking RowData [0] as an example, the first group of data RowData [0] is taken, and the data of RowData [0] is further taken as: a1, a2, A3, a4, a5, a6, a7, A8, a9, a 10. Here, the data separator is ",", and 10 data (i.e., a1 to a10) are produced by column separation by the separator. Setting an array variable COLDATA [ ] of 10 elements, and sequentially storing data in the COLDATA, wherein COLDATA [0] ═ A1 ', COLDATA [1] ═ A2', COLDATA [2] ═ A3 ', …, and COLDATA [9] ═ A10';
(3) and loading a data mirror configuration table, wherein the mirror configuration table can be set according to actual conditions. For example, the configuration format is: the data interface names are column 1, column 2, column 3, column 4, column 5, column 6, column 7, column 8, and column 9. The arrangement of columns can be set according to the service requirements, for example, IN1:0,2,1,4,3,5,7,8, 9. IN which IN1 is the name of the data interface, and 0,2,1,4,3,5,7,8, and 9 are the column data mirror mapping relationship. In the step (2), the sequence of the original data after the row-column splitting is 0,1,2,3,4,5,6,7,8, and 9. The mirror image configuration table can rearrange and screen the column data of the original data according to the business requirements, and can also remove useless data. The determination of whether the data is useless may be determined according to actual business conditions, for example, some fields are useless or some information is useless in some businesses, and the useless data can be removed through the mirroring step.
Then, storing the mirror image mapping data into a variable array, namely, the mirror image mapping data is stored into [0,2,1,4,3,5,7,8,9 ];
(4) as shown in fig. 3, data mirroring is performed according to the column mirroring relationship to obtain a new mirroring result. Specifically, the mirror image mapping array variable MIRRORTABLE [ ] and the column group COLDATA [ ] may be obtained first, the mirror image data storage variable MIRRORDATA is set, and then the mirror image mapping array variable is sequentially read as the serial number of the column group COLDATA [ ] to rearrange and accumulate the data and store the data in the data storage variable MIRRORDATA, that is, the mirror image mapping array variable is stored in the data storage variable MIRRORDATA
MIRRORDATA=COLDATA[MIRRORTABLE[0]]+”,”
+COLDATA[MIRRORTABLE[1]]+”,”
+……
+COLDATA[MIRRORTABLE[8]]
The final result of the mirror image is:
MIRRORDATA ═ a1, A3, a2, a5, a4, a6, A8, a9, a10 ", where data" a7 "has been removed in a mirror image relationship;
(5) forming a new data result MIRRORDATAS after the data lines finish mirroring according to the steps (3) and (4);
(6) judging whether a cache variable used for data interface data exists at present, if not, creating a cache variable, accumulating the mirror image data MIRRORDATAS into the cache variable, and counting the number of cache data records in the cache variable. The checkpoint 2 variable for data verification is set to CHECKPOINTDATA2, and the number of buffered data records is stored in CHECKPOINTDATA 2. For the example shown in fig. 3, chekpointdata 2 is 3.
S103, if the size of the cache file reaches a set value, judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value or not;
the specific checking steps include: firstly, when the system is started, a data inspection thread is started to wait for data inspection. If the size of the buffer variable is determined to reach the set value, the chekpointdata 1 (the number of original data records) and the chekpointdata 2 (the number of buffer data records) of the current thread can be obtained and compared. If the error of the data record number of the two data records is smaller than or equal to the preset threshold value, the detection is considered to be successful, namely, the data loss is small, the current accuracy of the data is high, and the next step of processing can be carried out at the moment; if the data loss is larger than the preset threshold value, the inspection is not successful, namely the data loss is more, the accuracy of the current data is lower, at the moment, early warning information can be set to inform workers that the data loss is more, whether the mirroring processing of the original data needs to be executed again or not is determined according to the configuration, and the result is recorded again until the result inspection is successful.
S104, if the current data is smaller than the preset data, storing the cache file into a storage file according to a specified configuration path, and recording the number of stored data records in the storage file;
specifically, the method comprises the following steps:
(1) starting N (for example, 10) data storage thread queues according to the system configuration data;
(2) the thread queue acquires the cache file which is successfully checked in S103 and needs to be transmitted, namely the data in the cache variable;
(3) the queue thread stores the cache data according to a configuration path through a storage interface of a distributed storage system (for example, an HDFS system);
(4) the queue thread obtains storage result information, including distributed storage information such as storage file blocks, paths, sizes and the like, counts the number of storage data records in the storage file, sets a check point 3 variable for data inspection to be CHECKPOINTDATA3, and stores the number of the storage data records into CHECKPOINTDATA 3.
S105, judging whether the error between the number of the cached data records and the number of the stored data records is smaller than a preset threshold value or not; and if the number of the storage files is smaller than the preset value, the storage files are sent to an external distributed storage system for storage.
The data check thread queue that was started at the beginning of system startup may now check for stored data records. The specific checking steps include: the queue obtains and compares the CHECKPOINTDATA2 (number of buffered data records) and CHECKPOINTDATA3 (number of stored data records) for the current thread. If the error of the data record number of the two data records is smaller than or equal to a preset threshold value, the verification is considered to be successful, namely the data loss is small, the current accuracy of the data is high, and the thread storage is finished at the moment; if the data loss is larger than the preset threshold value, the verification is not successful, namely the data loss is more, the accuracy of the current data is lower, at the moment, early warning information can be set to inform workers that the data loss is more, whether distributed storage of the cache variable data needs to be executed again or not is determined according to configuration, and the result is recorded again until the result is verified successfully. After the verification is successful, the storage file can be sent to an external distributed storage system for storage.
According to the embodiment of the invention, the primary data is subjected to mirror image processing through the steps, so that redundant data is removed, and the storage pressure is reduced. Meanwhile, the cached data, the final stored data and the initial original data in the period can be checked, and the further processing can be carried out when the error is smaller than a preset value, so that the quality of the data can be controlled at a plurality of links, the condition that the stored data is more lost is avoided, and the accuracy of the data is effectively improved.
In the existing distributed storage method, besides the defects of large data redundancy and large possibility of data loss, the defect of low processing efficiency also exists, so that the computing resources of the server cannot be reasonably utilized, specifically, the utilization rate is large when the computing amount of the server resources is large, and the utilization rate is small when the computing amount is small, so that the data arrival time delay is easily caused.
Based on this, the method provided by the embodiment of the present invention may further include:
(1) for the computational monitoring and regulation of the resource condition of the native system, as shown in fig. 4, the method specifically includes:
s1, acquiring the resource condition of the local system, and calculating the current resource load value of the local system;
specifically, the resource monitoring thread is started at the beginning of system startup. Thread 1 starts to acquire the resource condition of a local server at the frequency of once per second, and calculates the load value SysLoad of the current resource;
s2, if the resource load value SysLoad of the native system is larger than a first threshold value X1, reducing the data mirror image processing queue;
s3, if the resource load value SysLoad of the native system is smaller than a second threshold value X2, adding a data mirror image processing queue;
wherein the first threshold value X1 is here greater than the second threshold value X2. The native system specifically refers to a server for receiving, mirroring storage and data verification of the original data.
(2) For the calculation monitoring and regulation of the resource condition of the external distributed storage system, as shown in fig. 4, the calculation monitoring and regulation may specifically include:
s1', acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
specifically, the resource monitoring thread is started at the beginning of system startup. The thread 2 starts to acquire the resource condition of the external distributed storage system at the frequency of once per second, and calculates the load value DFSLoad of the current resource;
s2', if the resource load value DFSLoad of the external distributed storage system is larger than a third threshold Y1, reducing the data mirroring storage queue;
s3', if the resource load value DFSLoad of the external distributed storage system is smaller than a fourth threshold Y2, adding a data mirror image storage queue;
wherein the third threshold value Y1 is here greater than the fourth threshold value Y2. The external distributed storage system is a file system which is external to the server and can perform distributed storage after dividing data files into blocks, is usually used for storage and parallel computing of large data, and has the characteristic of high availability.
It should be noted that fig. 4 only shows the case of performing calculation monitoring and control on both the local system and the external distributed storage system, and in practical application, only the local system may be monitored, only the external distributed storage system may be monitored, or both the local system and the external distributed storage system may be monitored simultaneously.
In a second aspect, an embodiment of the present invention provides a real-time big data mirroring storage device, as shown in fig. 5, including:
a data receiving module 201, configured to receive a real-time data source;
the data mirror image processing module 202 is configured to perform row-column splitting on the original data in the real-time data source to obtain an original data record number of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
the data checking module 203 is configured to determine whether an error between the original data record number and the cache data record number is smaller than a preset threshold value if the size of the cache variable reaches a set value;
the data mirror image storage module 204 is configured to, if the determination result is less than the predetermined threshold, store the data in the cache variable into a storage file according to a specified configuration path, and record the number of stored data records in the storage file;
the data checking module 203 is further configured to determine whether an error between the number of cached data records and the number of stored data records is smaller than a preset threshold; and if the storage file is smaller than the preset storage file, sending the storage file to an external distributed storage system for storage.
Optionally, the data mirroring processing module is further configured to:
loading a data mirror configuration table;
and mirroring the row and column data of each row in the original data according to the row data mirroring mapping relation configured in the configuration table to obtain a mirrored data result.
Optionally, the apparatus further comprises a computing resource monitoring module 205 configured to:
acquiring the resource condition of a local system, and calculating the current resource load value of the local system;
if the resource load value of the native system is greater than a first threshold value, reducing a data mirroring processing queue;
if the resource load value of the native system is smaller than a second threshold value, adding a data mirror processing queue;
wherein the first threshold is greater than the second threshold.
Optionally, the apparatus further comprises a computing resource monitoring module 205 configured to:
acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
if the resource load value of the external distributed storage system is larger than a third threshold value, reducing a data mirror image storage queue;
if the resource load value of the external distributed storage system is smaller than a fourth threshold value, adding a data mirror image storage queue;
wherein the third threshold is greater than the fourth threshold.
Fig. 6 shows a schematic structural diagram of a real-time big data mirroring storage device according to an embodiment of the present invention.
Since the real-time big data mirror storage device described in this embodiment is a device that can execute the real-time big data mirror storage method in the embodiment of the present invention, based on the real-time big data mirror storage method described in the embodiment of the present invention, a person skilled in the art can understand the specific implementation manner and various variations of the real-time big data mirror storage device in this embodiment, so a detailed description of how the real-time big data mirror storage device implements the real-time big data mirror storage method in the embodiment of the present invention is not given here. As long as a person skilled in the art implements the apparatus used in the method for real-time big data mirror storage in the embodiment of the present invention, the apparatus is within the scope of the present application.
Fig. 7 shows a block diagram of a computer device according to an embodiment of the present invention.
Referring to fig. 7, the computer apparatus includes: a processor (processor)301, a memory (memory)302, a bus 303, and a bus interface 304;
the processor 301 and the memory 302 complete communication with each other through the bus 303, and the bus interface 304 is used for interacting with external devices.
The processor 301 is configured to call program instructions in the memory 302 to perform the methods provided by the above-described method embodiments.
Embodiments of the present invention also disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments.
Embodiments of the present invention also provide a non-transitory computer-readable storage medium, which stores computer instructions, and the computer instructions cause the computer to execute the methods provided by the above method embodiments.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Some component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components of a gateway, proxy server, system according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (8)

1. A real-time big data mirror image storage method is characterized by comprising the following steps:
receiving a real-time data source;
performing row-column splitting on original data in the real-time data source to obtain the original data record number of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
if the size of the cache variable reaches a set value, judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value;
if the number of the cache data in the cache variable is smaller than the preset value, storing the cache data in the cache variable into a storage file according to a specified configuration path, and recording the number of the stored data records in the storage file;
judging whether the error between the number of the cached data records and the number of the stored data records is smaller than a preset threshold value or not; if the number of the storage files is smaller than the preset number, the storage files are sent to an external distributed storage system for storage;
the method for processing the mirror image of the original data according to a preset mirror image algorithm to obtain a data result after the mirror image comprises the following steps:
loading a data mirror configuration table;
and mirroring the row and column data of each row in the original data according to the row data mirroring mapping relation configured in the configuration table to obtain a mirrored data result.
2. The method of claim 1, further comprising:
acquiring the resource condition of a local system, and calculating the current resource load value of the local system;
if the resource load value of the native system is greater than a first threshold value, reducing a data mirroring processing queue;
if the resource load value of the native system is smaller than a second threshold value, adding a data mirror processing queue;
wherein the first threshold is greater than the second threshold.
3. The method of claim 1, further comprising:
acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
if the resource load value of the external distributed storage system is larger than a third threshold value, reducing a data mirror image storage queue;
if the resource load value of the external distributed storage system is smaller than a fourth threshold value, adding a data mirror image storage queue;
wherein the third threshold is greater than the fourth threshold.
4. A real-time big data mirror storage device, comprising:
the data receiving module is used for receiving a real-time data source;
the data mirror image processing module is used for splitting rows and columns of original data in the real-time data source to obtain the number of original data records of the original data; carrying out mirror image processing on the original data according to a preset mirror image algorithm to obtain a data result after mirror image processing, storing the data result into a cache variable, and recording the number of cache data records in the cache variable;
the data checking module is used for judging whether the error between the original data record number and the cache data record number is smaller than a preset threshold value or not if the size of the cache variable reaches a set value;
the data mirror image storage module is used for storing the data in the cache variables into a storage file according to a specified configuration path and recording the number of stored data records in the storage file if the judgment is smaller than the preset value;
the data checking module is further configured to determine whether an error between the number of cached data records and the number of stored data records is smaller than a preset threshold; if the number of the storage files is smaller than the preset number, the storage files are sent to an external distributed storage system for storage;
wherein the data mirror processing module is further configured to:
loading a data mirror configuration table;
and mirroring the row and column data of each row in the original data according to the row data mirroring mapping relation configured in the configuration table to obtain a mirrored data result.
5. The apparatus of claim 4, further comprising a computing resource monitoring module to:
acquiring the resource condition of a local system, and calculating the current resource load value of the local system;
if the resource load value of the native system is greater than a first threshold value, reducing a data mirroring processing queue;
if the resource load value of the native system is smaller than a second threshold value, adding a data mirror processing queue;
wherein the first threshold is greater than the second threshold.
6. The apparatus of claim 4, further comprising a computing resource monitoring module to:
acquiring the resource condition of the external distributed storage system, and calculating the current resource load value of the external distributed storage system;
if the resource load value of the external distributed storage system is larger than a third threshold value, reducing a data mirror image storage queue;
if the resource load value of the external distributed storage system is smaller than a fourth threshold value, adding a data mirror image storage queue;
wherein the third threshold is greater than the fourth threshold.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-3 are implemented when the program is executed by the processor.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN201710771908.6A 2017-08-31 2017-08-31 Real-time big data mirror image storage method and device Active CN109426438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710771908.6A CN109426438B (en) 2017-08-31 2017-08-31 Real-time big data mirror image storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710771908.6A CN109426438B (en) 2017-08-31 2017-08-31 Real-time big data mirror image storage method and device

Publications (2)

Publication Number Publication Date
CN109426438A CN109426438A (en) 2019-03-05
CN109426438B true CN109426438B (en) 2021-09-21

Family

ID=65505284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710771908.6A Active CN109426438B (en) 2017-08-31 2017-08-31 Real-time big data mirror image storage method and device

Country Status (1)

Country Link
CN (1) CN109426438B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765479B (en) * 2019-11-03 2020-04-24 长沙豆芽文化科技有限公司 Big data loss prevention method, device and equipment
CN111522797B (en) * 2020-04-27 2023-06-02 支付宝(杭州)信息技术有限公司 Method and device for constructing business model based on business database
CN113806323A (en) * 2020-06-11 2021-12-17 中移(苏州)软件技术有限公司 Data processing method and device, electronic equipment and computer storage medium
CN114371810B (en) * 2020-10-15 2023-10-27 中国移动通信集团设计院有限公司 Data storage method and device of HDFS

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009223355A (en) * 2008-03-13 2009-10-01 Hitachi Software Eng Co Ltd Disk control system for performing mirroring of hard disk and silicon disk
CN104765575A (en) * 2015-04-23 2015-07-08 成都博元时代软件有限公司 Information storage processing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8402213B2 (en) * 2008-12-30 2013-03-19 Lsi Corporation Data redundancy using two distributed mirror sets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009223355A (en) * 2008-03-13 2009-10-01 Hitachi Software Eng Co Ltd Disk control system for performing mirroring of hard disk and silicon disk
CN104765575A (en) * 2015-04-23 2015-07-08 成都博元时代软件有限公司 Information storage processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于重复数据删除的虚拟机镜像存储优化的研究与实现;李张娟;《现代计算机》;20160131;第29-31页 *

Also Published As

Publication number Publication date
CN109426438A (en) 2019-03-05

Similar Documents

Publication Publication Date Title
CN109426438B (en) Real-time big data mirror image storage method and device
WO2017096968A1 (en) Log uploading method and apparatus
CN107341258B (en) Log data acquisition method and system
CN108521339B (en) Feedback type node fault processing method and system based on cluster log
CN106815254B (en) Data processing method and device
CN105607986A (en) Acquisition method and device of user behavior log data
CN105955807B (en) Task processing system and method
CN109299052B (en) Log cutting method, device, computer equipment and storage medium
CN112506619B (en) Job processing method, job processing device, electronic equipment and storage medium
CN105183585B (en) Data backup method and device
CN111682981A (en) Check point interval setting method and device based on cloud platform performance
CN111177193A (en) Flink-based log streaming processing method and system
CN112256551A (en) Remote log capturing method and device, electronic equipment and storage medium
CN110334011B (en) Method and device for executing test case
CN104376088A (en) Distributed synchronization method of cloud database and database system
CN108573172B (en) Data checking and storing method and device
CN110442439B (en) Task process processing method and device and computer equipment
CN109284257B (en) Log writing method and device, electronic equipment and storage medium
CN111259081A (en) Data synchronization method and device, electronic equipment and storage medium
CN116414914A (en) Data synchronization method and device, processor and electronic equipment
CN110597794A (en) Data processing method and device and electronic equipment
EP3099012A1 (en) A method for determining a topology of a computer cloud at an event date
CN115664992A (en) Network operation data processing method and device, electronic equipment and medium
KR20160145250A (en) Shuffle Embedded Distributed Storage System Supporting Virtual Merge and Method Thereof
CN115576782A (en) Transaction processing method and device based on monitoring mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant