CN109542900B

CN109542900B - Data processing method and device

Info

Publication number: CN109542900B
Application number: CN201811324792.2A
Authority: CN
Inventors: 姚海波; 郭仁康; 常志娟
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2021-07-13
Anticipated expiration: 2038-11-08
Also published as: CN109542900A

Abstract

The invention discloses a data processing method and device. The data processing method comprises the following steps: the first host reads at least one data record from at least one storage device; the first host computer processes the at least one data record to obtain K fragmented files; a first shard file of the K shard files, the first host providing the first shard file to a first storage device of the at least one storage device.

Description

Data processing method and device

Technical Field

The present invention relates to the field of databases, and in particular, to a data processing method and apparatus.

Background

Data records are accumulated in the business system, and the data records are finally combined into a final file, such as a reconciliation file. An intermediate file, called a fragmented file, is generated during the merging process. Each data record has a unique classification number, such as the organization number of the data record, and data records with the same organization number can be classified into one type.

In the prior art, when a host processes data records in a business system, a plurality of fragmented files are generated from the data records corresponding to one classification number and are merged into a final file, but this way can only process data of one classification number.

Therefore, the host in the prior art cannot process the corresponding data of a plurality of classification numbers, which is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a data processing method and device, and solves the problem that a host cannot process corresponding data of a plurality of classification numbers in the prior art.

The embodiment of the invention provides a data processing method, which comprises the following steps:

the first storage device reads N data records; in the N data records, each data record comprises a classification number of the data record, and the N data records comprise K different classification numbers; K. n is an integer greater than 0;

the first storage device provides the N data records to S hosts; s is an integer greater than 0;

the first storage device acquires a first fragmented file returned by a first host of the S hosts; the first fragment file is uniquely corresponding to a first classification number, and the first fragment file comprises at least one data record corresponding to the first classification number;

the first storage device writes the first fragment file into a first merged file uniquely bound with the first classification number; the first merged file is a file storing the data record of the first classification number.

Optionally, each data record includes a data number;

the first storage device providing the N data records to S hosts, comprising:

the first storage device determines P sets, wherein the number of each set is 0 to P-1, and each set corresponds to one host in the S hosts; p is an integer greater than 0;

and the first storage device performs remainder operation on the data number of each data record in the N data records and P to obtain a remainder operation result of each data record, and stores each data record into the P sets, wherein the number of each data record is the same as the remainder operation result of the data record.

Optionally, the obtaining, by the first storage device, the first fragmented file returned by the first host of the S hosts further includes:

the first storage equipment acquires a second fragmented file returned by the second host; the second fragmentation file comprises at least one data record corresponding to the first classification number.

the first host reads at least one data record from at least one storage device; each data record in the at least one data record comprises the classification number of the data record;

the first host computer processes the at least one data record to obtain K fragmented files; each fragment file in the K fragment files is uniquely corresponding to one classification number and comprises at least one data record corresponding to the classification number of the fragment file; k is an integer greater than 0;

the first host providing a first shard file of the K shard files to a first storage device of the at least one storage device; the first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing the merged file of the data record corresponding to the first classification number.

Optionally, the method includes:

the first storage device is a storage device corresponding to the hash value of the first classification number in the at least one storage device.

An embodiment of the present invention provides a data processing apparatus, including:

the reading module is used for reading the N data records; in the N data records, each data record comprises a classification number of the data record, and the N data records comprise K different classification numbers; K. n is an integer greater than 0;

a processing module for providing the N data records to S hosts; s is an integer greater than 0; acquiring a first fragmented file returned by a first host in the S hosts; the first fragment file is uniquely corresponding to a first classification number, and the first fragment file comprises at least one data record corresponding to the first classification number; and a first merge file for writing the first fragmented file to be uniquely bound to the first classification number; the first merged file is a file storing the data record of the first classification number.

Optionally, each data record includes a data number;

the processing module is specifically configured to:

determining P sets, wherein the number of each set is 0 to P-1, and each set corresponds to one host in the S hosts; p is an integer greater than 0; and performing remainder operation on the data number of each data record in the N data records and the P to obtain a remainder operation result of each data record, and storing each data record into the P sets, wherein the number of each set is the same as the remainder operation result of the data record.

Optionally, the processing module is further configured to:

acquiring a second fragmented file returned by the second host; the second fragmentation file comprises at least one data record corresponding to the first classification number.

a reading module for reading at least one data record from at least one storage device; each data record in the at least one data record comprises the classification number of the data record;

the processing module is used for processing the at least one data record to obtain K fragment files; each fragment file in the K fragment files is uniquely corresponding to one classification number and comprises at least one data record corresponding to the classification number of the fragment file; k is an integer greater than 0; and means for providing a first shard file of the K shard files to a first storage device of the at least one storage device; the first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing the merged file of the data record corresponding to the first classification number.

Optionally, the first storage device is a storage device corresponding to the hash value of the first classification number in the at least one storage device.

In the embodiment of the invention, the host writes the read N data records into the first class file corresponding to each classification number and then writes the N data records into the second class file according to the classification number of each data record, and the method processes the data records of a plurality of classification numbers; when the number of data records is increased, the method can be used for processing the data records under a plurality of hosts, each host writes the data record corresponding to each classification number into the first class file corresponding to the classification number and then collects the data records into the second class file, and therefore the data records corresponding to one classification number are processed in parallel by the plurality of hosts.

Drawings

Fig. 1 is a system architecture diagram corresponding to a data processing method in an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps corresponding to a data processing method in an embodiment of the present application;

FIG. 3 is a flowchart illustrating specific steps corresponding to a data processing method in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a data record read by a storage device corresponding to a data processing method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a host storing data into a fragment file according to a data processing method in an embodiment of the present invention;

FIG. 6 is a diagram illustrating a fragmented file generated by a host corresponding to a data processing method according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a host writing fragmented file information records to a storage device according to an embodiment of the present invention;

fig. 8 is a schematic diagram illustrating that a host corresponding to a data processing method in an embodiment of the present invention maps fragmented files of the same organization number to a storage device;

fig. 9 is a schematic diagram illustrating that a storage device corresponding to the data processing method generates a merged file from fragmented files with the same organization number in the embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a device corresponding to a data processing method according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a device corresponding to the data processing method in the embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto.

In many scenarios, a business system needs to process a large amount of data. These data records are merged into a final file, such as a reconciliation file. These data record intermediate files generated during the merging process, called fragmented files. For example, a financial transaction system is a set of software for online financial transactions. With the rapid development of internet finance, the business transaction amount is greatly increased, and the requirement on the financial transaction system processing is higher and higher. Taking the Chinese Unionpay clearing system as an example, with the development of company business, the rapid popularization of items such as cloud flash payment and code scanning payment, the transaction volume rapidly increases, the data volume needs to be measured by hundred million, and the operating memory reaches dozens of Gigabytes (GB). Such a huge amount of data records need to meet the requirements of high concurrency, high stability, high availability, and distributed processing when being processed. The embodiment of the present invention only takes the financial transaction system as an example to describe the data processing method provided by the embodiment of the present invention, and the method is not limited to the financial transaction system.

In the financial transaction system, the financial transaction data is a real-time online data record. Some important files, such as account checking files, are used as the only export of data processing and the evidence basis of fund payment, and the requirements on the correctness and the efficiency of data record processing are more strict. The concurrent batch processing system has the advantages of high efficiency, high stability, high availability and low cost. Obviously, in the face of such a large amount of data records, the concurrent batch processing system is suitable for being applied to a financial transaction system, and has an irreplaceable important role.

In order to ensure that a concurrent batch processing system can efficiently, highly stably and highly effectively record a large amount of data into an intermediate file, and according to certain constraint conditions, the method comprises the following steps: including but not limited to a company organization number, a host number, a file name, etc., and the like, are classified and combined into a final file. The data processing method includes converting data records into fragment files according to classification numbers, and writing the data records of the fragment files into a combined file, so that a file multi-machine distributed type combining processing scheme is achieved.

As shown in fig. 1, a system structure diagram corresponding to the data processing method in the embodiment of the present invention is shown, and the system structure includes two parts:

m storage devices: the storage device comprises a main database and a standby database, wherein the main database is used as a default database, and the standby database is used when the main database is down.

F hosts: and the data records in the M storage devices are written into the fragment files, and the data records in the fragment files are written into the merged file in the storage devices.

In the embodiment of the invention, the storage device, the host, the classification number, the fragment file and the merged file are in a binding relationship; specifically, each storage device has a plurality of data sets, each data set corresponds to one host, but one host corresponds to at least one data set; each host stores a plurality of fragment files, and each fragment file corresponds to one classification number; the merged files are also in one-to-one correspondence with the classification numbers, and the merged files correspond to one storage device, but one storage device corresponds to a plurality of merged files.

Fig. 2 is a flowchart illustrating steps corresponding to the data processing method according to an embodiment of the present invention.

Step 201: the first storage device reads the N data records.

In the N data records, each data record comprises a classification number of the data record, and the N data records comprise K different classification numbers; K. n is an integer greater than 0.

Step 202: the first storage device provides the N data records to S hosts.

S is an integer greater than 0.

Step 203: the first host reads at least one data record from at least one storage device.

Each data record of the at least one data record includes the classification number of the data record.

Step 204: and the first host machine processes the at least one data record to obtain K fragmented files.

Each fragment file in the K fragment files is uniquely corresponding to one classification number and comprises at least one data record corresponding to the classification number of the fragment file; k is an integer greater than 0.

Step 205: the first host provides a first shard file of the K shard files to a first storage device of the at least one storage device.

Step 206: and the first storage equipment acquires a first fragmented file returned by a first host in the S hosts.

The first fragment file is uniquely corresponding to a first classification number, and the first fragment file comprises at least one data record corresponding to the first classification number.

Step 207: the first storage device writes the first fragmented file to a first merged file that is uniquely bound to the first classification number.

The first merged file is a file storing the data record of the first classification number.

In step 201, each data record includes a classification number and a data number, where the classification number is a field dividing the data record into different types, for example, an organization number to which the data belongs is used as the classification number, and the first storage device is one of the M storage devices in the architecture diagram of fig. 1.

One possible implementation manner is to divide the finally mapped hash values into H, where each hash value corresponds to one storage area in one storage device, and each block identifier is a block identifier, where H is an integer greater than 0. H is a multiple of M, and each storage device has a plurality of storage areas, but each storage area is only in one storage device. Wherein the identification of the memory area is called a block identification. The data records of the type under each classification number are uniformly distributed in each storage area, for example, W classification numbers such as organization numbers are involved in the data records, the data of each organization is uniformly stored in each storage area, and W is an integer greater than 0.

In step 202, the first storage device provides the N data records to S hosts.

Wherein the data record of each storage area corresponds to one of the S hosts.

For example, host 1 obtains data for block identifications 1-3. For the data of organization 1, host 1 distributes organization 1 to the data of storage areas 1-3, and other application host processes are similar to host 1.

In step 203, the first host reads at least one data record from at least one storage device.

In step 204, the first host records the data of each classification number to generate a fragment file.

For example, the first host writes the read data record into the fragmented file corresponding to mechanism 1 under the host directory. Other application host processes are similar to host 1. The names of fragmented files generated by different hosts can be distinguished by adding identifications such as host ID and the like, so that collision is prevented when the fragmented files are written.

Step 205: after the first host generates the K fragmented files, each fragmented file is provided to a corresponding first storage device of the at least one storage device.

The first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing the merged file of the data record corresponding to the first classification number.

Step 207: and after receiving the first fragment file, the first storage device writes the first fragment file into a first combined file uniquely bound by the corresponding classification number, and completes the combination.

Fig. 3 is a flowchart illustrating specific steps of a data processing method according to an embodiment of the present invention.

Step 301: the host reads N data records from the M storage devices.

In the N data records, each data record comprises a classification number of a type corresponding to the data record; m, N is an integer greater than 0.

Step 302: and aiming at each data record in the N data records, the host writes the data record into a file system corresponding to the host, and the classification number of the data record is the only corresponding fragment file.

The fragmented file is an intermediate file in the process of writing the N data records into the merged file.

Step 303: and the host writes the data record of the first class file corresponding to each classification number in the file system into the merged file uniquely corresponding to the classification number.

Wherein the number of hosts writing data records to the merged file is at least one.

Prior to step 301, M storage devices read data records.

As shown in fig. 4, a schematic diagram of reading data records for M storage devices.

In the embodiment of the invention, in order to meet the requirement of the transverse expansion of the storage equipment, a set of reasonable data segmentation rules is formulated to segment the full data source, so that data records stored by M storage equipment are balanced. The M storage devices utilize a batch processing system in the storage process, and the batch processing system is suitable for processing mass data. A batch processing system is a processing system that processes multiple pieces of data at a time. After being processed by the batch processing system, the data records are stored in the storage device according to the result of a certain dimension aggregation, but also generated in the form of files. The dimensions include, but are not limited to, primary key of data record, payment order number, practitioner number, transaction type, etc. Each dimension is referred to in this application as a classification number.

The scheme records data according to a string hash algorithm (ELFHaFh, an algorithm for calculating an American standard Code for Information exchange (AFCII) Code value of a string by bit calculation). Under the algorithm, the data records can be uniformly dispersed in the corresponding storage areas. For example, the AFCII encoded value is modulo.

In step 302, as shown in fig. 5, it is a schematic diagram of the host storing data into the fragmented file according to the embodiment of the present invention. Fig. 6 is a schematic diagram of generating a fragmented file by a host in an embodiment of the present invention.

Suppose there are F hosts, each host has P processes simultaneously performing fragment file generation, each process finds database connection information in the knowledge base according to the block identifier and corresponding parameters (table index, system name, etc.), and acquires data from the storage area corresponding to the block identifier of the master base. The knowledge base is a mapping table of the block identifiers and the host process. And when the main library is down, acquiring data from the standby library.

For each host, there is a file directory in the network file system. The host stores the acquired data into a fragment file in the host directory.

Fig. 5 is an example in which each host reads data from 3 storage areas, but is not limited to 3 storage areas, and may be any integer number such as 1, 2, 4, and the like. For example, host 1 obtains data for block identifications 1-3. For the data of the organization 1, the host 1 writes the data of the organization 1 distributed in the storage areas 1-3 into the fragment file corresponding to the organization 1 under the host directory. Other application host processes are similar to host 1. The names of fragmented files generated by different hosts can be distinguished by adding identifications such as host ID and the like, so that collision is prevented when the fragmented files are written.

After step 302, each process of each host writes information records of the fragmented file generated by the process, such as a storage path, a fragmented file name, a block identifier, a mechanism number, and the like of the fragmented file, into the corresponding storage device, and records which database the fragmented file information records are specifically written into, which is also obtained by searching according to the knowledge base.

After step 302, as shown in fig. 7, since each host generates a fragmented file with the same organization number, and the fragmented file is placed in the host directory corresponding to each host. The embodiment of the invention disperses all the fragmented files in H storage areas according to the mechanism number through an ELFHaFh algorithm for the fragmented files of the same mechanism number in each host through the one-to-many relationship between the mechanism number and the fragmented files, and writes the information record of the fragmented files into one storage device of M storage devices.

After step 302, as shown in fig. 8, fig. 8 is a schematic diagram of mapping fragmented files of the same organization number to a storage device by a host corresponding to the data processing method in the embodiment of the present invention.

At this stage, the processing method according to the embodiment of the present invention records and hashes all the fragmented file information of the same merged file of the same organization number into the same partition identifier by using the ELFHaFh algorithm through the correspondence between the fragmented file name and the merged file name and the office organization number.

For example, when the host 1 receives the scheduling information of the chunk identifiers 1-3 (other application hosts will not receive these three chunk identifiers), the host 1 will take out the fragment file information records with the chunk identifiers 1-3 in the database, find the fragment file according to the fragment file name, path, etc. in the fragment file information records, and record the fragment file name, fragment file path, merge file name, merge file path, mechanism number, and chunk identifier, etc. in the merge file information table in the storage device.

In step 303, as shown in fig. 9, a schematic diagram that a storage device corresponding to the data processing method generates a merged file from fragmented files of the same organization number in the embodiment of the present invention.

When each host merges fragmented files, R processes on each host simultaneously merge fragmented files. For each process on the host, the process receives the block identifier and the knowledge base of the merged file, acquires the connection information of the main library and the standby library of the merged file information table, acquires the data which is the same as the block identifier in the merged file information table of the main library, and completes the multi-machine merging and processing process of the fragmented files through the information of the fragmented file names, the merged file names, the directories where the fragmented files are located, the directories where the merged files are located and the like in the data.

The embodiment of the invention has the following advantages:

1. the fragmented files with the same mechanism number are distributed in each host, so that the distribution is more balanced, and the overall processing performance of the system is greatly improved;

2. the host of a certain database fails to work down, so that the overall processing result of the system cannot be influenced;

3. the number of the host and the database can be transversely expanded without limit according to the actual situation, namely, multi-computer processing is carried out, and no network or disk bottleneck exists;

4. the reading and writing separation is realized, the unified management of file handles is not needed in the using process, the overall efficiency is improved, and the conflict and collision of the file handles are avoided;

5. and the multi-host business processing improves the processing efficiency of the system.

As shown in fig. 10, which is a schematic structural diagram of a device corresponding to the data processing method in the embodiment of the present invention, the device includes:

a reading module 1001, configured to read N data records; in the N data records, each data record comprises a classification number of the data record, and the N data records comprise K different classification numbers; K. n is an integer greater than 0;

a processing module 1002, configured to provide the N data records to S hosts; s is an integer greater than 0; acquiring a first fragmented file returned by a first host in the S hosts; the first fragment file is uniquely corresponding to a first classification number, and the first fragment file comprises at least one data record corresponding to the first classification number; and a first merge file for writing the first fragmented file to be uniquely bound to the first classification number; the first merged file is a file storing the data record of the first classification number.

Optionally, each data record includes a data number;

the processing module 1002 is specifically configured to:

Optionally, the processing module 1002 is further configured to:

As shown in fig. 11, which is a schematic structural diagram of a device corresponding to the data processing method in the embodiment of the present invention, the device includes:

a reading module 1101 for reading at least one data record from at least one storage device; each data record in the at least one data record comprises the classification number of the data record;

a processing module 1102, configured to process the at least one data record to obtain K fragmented files; each fragment file in the K fragment files is uniquely corresponding to one classification number and comprises at least one data record corresponding to the classification number of the fragment file; k is an integer greater than 0; and means for providing a first shard file of the K shard files to a first storage device of the at least one storage device; the first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing the merged file of the data record corresponding to the first classification number.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A data processing method, comprising:

the first storage equipment acquires a second fragmented file returned by the second host; the second fragment file comprises at least one data record corresponding to the first classification number;

2. The method of claim 1, wherein each data record includes a data number;

the first storage device providing the N data records to S hosts, comprising:

3. A data processing method, comprising:

the first host providing a first shard file of the K shard files to a first storage device of the at least one storage device; the first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing a merged file of the data records corresponding to the first classification number; the first storage device is a storage device corresponding to the hash value of the first classification number in the at least one storage device.

4. A data processing apparatus, comprising:

a processing module for providing the N data records to S hosts; s is an integer greater than 0; acquiring a first fragmented file returned by a first host of the S hosts and a second fragmented file returned by a second host; the second fragment file comprises at least one data record corresponding to the first classification number; the first fragment file is uniquely corresponding to a first classification number, and the first fragment file comprises at least one data record corresponding to the first classification number; and a first merge file for writing the first fragmented file to be uniquely bound to the first classification number; the first merged file is a file storing the data record of the first classification number.

5. The apparatus of claim 4, wherein each data record includes a data number;

the processing module is specifically configured to:

6. A data processing apparatus, comprising:

the processing module is used for processing the at least one data record to obtain K fragment files; each fragment file in the K fragment files is uniquely corresponding to one classification number and comprises at least one data record corresponding to the classification number of the fragment file; k is an integer greater than 0; and means for providing a first shard file of the K shard files to a first storage device of the at least one storage device; the first classification number corresponding to the first fragmented file is the same as one of the at least one classification number bound to the first storage device; the first fragmented file is an intermediate file for merging corresponding data records of the first classification number; the first storage device is used for storing a merged file of the data records corresponding to the first classification number; the first storage device is a storage device corresponding to the hash value of the first classification number in the at least one storage device.