CN105931091B - File generation method and device - Google Patents

File generation method and device Download PDF

Info

Publication number
CN105931091B
CN105931091B CN201510670633.8A CN201510670633A CN105931091B CN 105931091 B CN105931091 B CN 105931091B CN 201510670633 A CN201510670633 A CN 201510670633A CN 105931091 B CN105931091 B CN 105931091B
Authority
CN
China
Prior art keywords
information
data set
data sets
record information
transaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510670633.8A
Other languages
Chinese (zh)
Other versions
CN105931091A (en
Inventor
樊华
冯哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201510670633.8A priority Critical patent/CN105931091B/en
Publication of CN105931091A publication Critical patent/CN105931091A/en
Application granted granted Critical
Publication of CN105931091B publication Critical patent/CN105931091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to the technical field of data processing, in particular to a file generation method and a file generation device, which are used for solving the technical problem that in the prior art, under the condition of mass transaction information, the file generation performance is not high, so that a method capable of improving the file generation performance of the mass transaction information is urgently needed. In the embodiment of the invention, the recorded information in the first data set is divided into a plurality of second data sets, each second data set comprises all contents of a piece of transaction information, then for each second data set, the recorded information with the same identifier is combined into a piece of transaction information according to a set rule, and the transaction record of the second data set is generated, so that the contents in all the first data sets do not need to be combined, and only each second data set needs to be respectively assembled with the transaction information, thereby reducing the data volume, improving the performance of assembling massive information into the transaction information, and improving the efficiency of file generation.

Description

File generation method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a file generation method and apparatus.
Background
For convenience of management, when the amount of the transaction information is large, the transaction information is generally dispersed into a plurality of cascade tables according to each field of the transaction information, for example, each field of each transaction information is a name, time, and amount, and then information of a corresponding field of each transaction can be stored by using three cascade tables, for example, a first table stores the name of each transaction information, a second table stores the time of each transaction information, and a third table stores the amount of each transaction information. In this way, one transaction record is divided into three pieces of record information which are respectively stored in three tables, and the same main key identification is distributed to the record information belonging to the same transaction information, so that the record information can be assembled into the transaction information.
Based on the storage method of the transaction information, when the record information of a plurality of cascading tables needs to be assembled into complete transaction information, the prior art is mainly realized by a database sorting-based method: taking the example that massive transaction information is stored in two cascade tables respectively, the two cascade tables are an a table and a B table respectively, wherein the a table stores basic information of the transaction information, the B table stores additional information of the transaction information, records in the a table and the B table are in a one-to-many relationship, that is, one transaction information includes one piece of basic information, is stored in the a table, and includes a plurality of pieces of additional information, is stored in the B table, and has the same main key identifier as the basic information and the plurality of pieces of additional information belonging to the same transaction information, then the prior art performs the assembly method of the transaction information based on the two tables as follows: firstly, sorting the A table and the B table in an ascending order according to the primary key identification, reading one piece of basic information in the A table according to the primary key identification, reading all the additional information with the same primary key identification in the B table, assembling the read basic information and a plurality of pieces of additional information into one piece of transaction information, reading the next piece of basic information in the A table, reading all the additional information with the same primary key identification in the B table, assembling into a second piece of transaction information until the A table and the B table are read, and assembling all the basic information and the additional information into corresponding transaction information.
According to the method, all cascading tables are sorted based on a database, then the recorded information in the cascading tables is assembled into transaction information based on the sorted cascading tables, and under the condition that the data volume of the current transaction information is huge, the requirement on the performance of the database is high by sorting massive transaction information, so that the performance of the database becomes a performance bottleneck of file generation of massive transaction information; meanwhile, the content of the transaction information is required to be more and more detailed, which leads to the increase of the number of the cascade tables and further deteriorates the file generation efficiency of the transaction information.
In summary, in the prior art, under the condition of a large amount of transaction information, the file generation performance is not high, and therefore a method capable of improving the file generation performance of the large amount of transaction information is urgently needed.
Disclosure of Invention
The invention provides a file generation method, which is used for solving the technical problem that in the prior art, under the condition of mass transaction information, the file generation performance is not high, so that a method capable of improving the file generation performance of the mass transaction information is urgently needed.
In one aspect, a file generation method provided in an embodiment of the present invention includes:
dividing the record information in N first data sets into M second data sets according to the identification of the record information, wherein the record information with the same identification in the N first data sets forms a piece of transaction information, the fields of the transaction information corresponding to the record information of any two first data sets are not completely the same, and the record information of each second data set comprises each field of one piece of transaction information; wherein N is greater than 1 and M is greater than 1;
for each second data set, forming a piece of transaction information by using the record information with the same identifier in the second data set according to a set rule, and generating a transaction record of the second data set;
and obtaining the transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
Optionally, the dividing, according to the identifier of the record information, the record information in the N first data sets into M second data sets includes:
the second data set comprises N sub-files, and the N sub-files correspond to the N first data sets one by one;
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set corresponding to the record information according to a first data set in which the record information is located and the second data set corresponding to the record information;
and writing the record information into the corresponding subfile in the second data set corresponding to the record information.
Optionally, the forming a transaction message by the record information with the same identifier in the second data set according to a set rule includes:
reading N sub-files in the second data set;
and writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms transaction information.
Optionally, the number M of the second data sets is determined according to the following manner:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
Optionally, determining the number of processes as P, where P is a positive integer, according to the system environment of the current processing system and the data size of the piece of transaction information;
determining a second data set to be processed by each process according to the number M of the second data sets and the number P of the processes, including:
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes.
In the embodiment of the invention, the recorded information in the first data set is divided into a plurality of second data sets, each second data set comprises all contents of a piece of transaction information, then for each second data set, the recorded information with the same identifier is combined into a piece of transaction information according to a set rule, and the transaction record of the second data set is generated, so that the contents in all the first data sets do not need to be combined, and only each second data set needs to be respectively assembled with the transaction information, thereby reducing the data volume, improving the performance of assembling massive information into the transaction information, and improving the efficiency of file generation.
On the other hand, a file generating apparatus provided in an embodiment of the present invention includes:
the dividing unit is used for dividing the record information in the N first data sets into M second data sets according to the identification of the record information, the record information with the same identification in the N first data sets forms a piece of transaction information, the fields of the transaction information corresponding to the record information of any two first data sets are not completely the same, and the record information of each second data set comprises the fields of one piece of transaction information; wherein N is an integer greater than 1, and M is an integer greater than 1;
the first generation unit is used for forming a piece of transaction information by recording information with the same identifier in each second data set according to a set rule and generating a transaction record of the second data set;
and the second generation unit is used for obtaining the transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
Optionally, the second data set includes N sub-files, and the N sub-files correspond to the N first data sets one to one; the dividing unit is specifically configured to:
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set corresponding to the record information according to a first data set in which the record information is located and the second data set corresponding to the record information;
and writing the record information into the corresponding subfile in the second data set corresponding to the record information.
Optionally, the first generating unit is specifically configured to:
reading N sub-files in the second data set;
and writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms transaction information.
Optionally, the dividing unit is further configured to:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
Optionally, the dividing unit is further configured to:
determining the number of processes as P according to the system environment of the current processing system and the data volume of the transaction information, wherein P is a positive integer;
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flowchart of a file generation method according to an embodiment of the present invention;
FIG. 2 is a diagram of a curve fitting function between the transaction information amount and the second data set value under the conditions that the number of CPUs is 4, the memory is 20G, the memory usage rate is 60%, and the single transaction is 4K;
FIG. 3 is a diagram of a curve fitting function between transaction information quantity and concurrent process value under the conditions that the number of CPUs is 4, the memory is 20G, the memory usage rate is 60%, and a single transaction is 4K;
FIG. 4 is a detailed flowchart of a file generation method according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for assigning record information of a first data set to a second data set according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a file generation apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto.
As shown in fig. 1, a method for generating a file according to an embodiment of the present invention includes:
step 101, dividing the record information in N first data sets into M second data sets according to the identification of the record information, wherein the record information with the same identification in the N first data sets forms a piece of transaction information, the fields of the transaction information corresponding to the record information of any two first data sets are not completely the same, and the record information of each second data set comprises the fields of one piece of transaction information; wherein N is greater than 1 and M is greater than 1;
102, aiming at each second data set, forming a piece of transaction information by using the record information with the same identifier in the second data set according to a set rule, and generating a transaction record of the second data set;
and 103, obtaining transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
In the above embodiment, the first data set and the second data set store different fields of the transaction information corresponding to the record information. The record information stored in each first data set is different fields of transaction information, if the transaction information has 4 fields, namely card number information, cardholder information, transaction time and transaction amount, the 4 first data sets respectively store one field of information; certainly, the information corresponding to two fields of the card number information and the card holder information can be used as a piece of recording information to be stored in a first data set, the information corresponding to one field of the transaction time is used as a piece of recording information to be stored in a second first data set, and the information corresponding to one field of the transaction amount is used as a piece of recording information to be stored in a third first data set; or the information corresponding to the two fields of the card number information and the card holder information can be stored as a piece of recorded information in a first data set, the information corresponding to the two fields of the card holder information and the transaction time can be stored as a piece of recorded information in a second first data set, the information corresponding to the two fields of the transaction time and the transaction amount can be stored as a piece of recorded information in a third first data set, and so on. The embodiment of the invention does not specifically limit the division of the transaction information field, but requires that the record information with the same identifier in the N first data sets form a piece of transaction information. The record information stored in each second data set includes the contents of the fields of a piece of transaction information. Therefore, in step 101, each piece of record information in each first data set may be sequentially read, and according to the identification of the record information, each piece of record information is divided into the corresponding second data set, or the record information in the plurality of first data sets is concurrently read by a plurality of processes, so that the record information in all the first data sets can be divided into the corresponding second data sets more quickly, for example, if there are K processes concurrently executing, the speed of dividing the record information in the first data set into the second data sets can be increased by K times. Since the record information of the fields of the same transaction have the same identification, each second data set comprises the record information of the fields of the transaction information, namely the record information of all the fields of one transaction is divided into the same second data set.
In the step 102, after each piece of record information in the N first data sets is stored in the corresponding second data set, next, for each second data set, the record information with the same identifier in the second data set is combined into one piece of transaction information according to a set rule, and a transaction record of the second data set is generated. For example, the set rule may be that all the record information in the second data set are sorted according to the primary key identifier, then each record information in the second data set is sequenced, and then the record information with the same primary key identifier is assembled into a transaction information; the set rule may also be that the record information in the second data set is not sorted, but each record information is read in sequence, and then the record information is written into a preset array structure, so that after a second data set is read, each record information in the second data set is written into a corresponding array, and then one array represents one transaction information.
In step 103, a transaction record file may be generated from all transaction information in each second data set, and then a final transaction record may be generated according to the transaction records of the M second data sets, where the final transaction record file includes all transaction information. Of course, the transaction records of the M second data sets may also be output as transaction record files of the N first data sets, and need not be collected into one transaction record file.
In the embodiment of the invention, the recorded information in the first data set is divided into a plurality of second data sets, each second data set comprises all contents of a piece of transaction information, then for each second data set, the recorded information with the same identifier is combined into a piece of transaction information according to a set rule, and the transaction record of the second data set is generated, so that the contents in all the first data sets do not need to be combined, and only each second data set needs to be respectively assembled with the transaction information, thereby reducing the data volume, improving the performance of assembling massive information into the transaction information, and improving the efficiency of file generation.
Specifically, in step 101, according to the identifier of the record information, the method for dividing the record information in the N first data sets into M second data sets may be that each second data set includes a file, the file is used for storing the record information, after the second data set corresponding to each record information in the N first data sets is determined, the record information may be written into the file in the corresponding second data set, and then all the record information corresponding to a certain second data set is written into the file in the second data set.
Optionally, in consideration of different data attributes of each field of the transaction information, N sub-files may be set in the second data set, where the N sub-files correspond to the N first data sets one to one;
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set according to a first data set where the record information is located and the determined second data set corresponding to the record information;
writing the record information to a corresponding subfile in the second data set.
In the method, each second data set comprises N sub-files, the N sub-files correspond to the N first data sets one by one, and the recorded information in the N first data sets has different storage structures, so the recorded information is respectively stored in the corresponding sub-files in the second data sets, and the recorded information can be conveniently managed, maintained and used.
Specifically, in step 102, for the case that there are N sub-files in each second data set, which are respectively used for storing the record information from the N first data sets, the record information with the same identifier in the second data set may also be composed into one piece of transaction information according to the set rule by the following manner:
for each second data set, reading N sub-files in the second data set; and writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms transaction information.
The method may be that, according to the number of the transaction information and the number of the first data sets, arrays of corresponding number and dimensionality are allocated in the memory, the number of the arrays corresponds to the total amount of the transaction information in the first data sets, the dimensionality of the arrays corresponds to the number of the first data sets or the number of fields of the transaction information, and the values of all the arrays are initialized to be null values, then N sub-files in the second data sets are read sequentially or in a multi-process concurrent manner, the record information with the same identification is allocated to one array according to the identification of the record information, and each record information of one transaction information is allocated to the component of the corresponding array, so that finally each array represents one transaction information, and all the record information of one transaction information is recorded in each array. The method is simple and easy to implement, and all the record information of one piece of transaction information is located in the same second data set, so that one piece of transaction information can be quickly assembled and finished, and the part of the memory can be recycled after the assembly is finished and output to the corresponding file, so that the memory is not excessively occupied, the processing capacity and the processing speed are improved, in addition, the method does not need to sequence the first data set or the second data set, but can directly read each piece of record information from front to back in sequence according to the storage sequence of the record information and carry out corresponding processing, and the method does not depend on the performance of a database, thereby reducing the limitation and improving the application capacity.
Specifically, for step 103, a final transaction record file may also be generated according to other methods according to the needs of a specific application. For example, when assembling transaction information for each second data set, each time the assembly of one piece of transaction information is completed, the assembled transaction information is written into a corresponding file according to the type of the transaction information, for example, the types of the transaction information include a work file, a farming file, a hiring file, a middle file and other files, if the type of the transaction information is the work file, the transaction information is written into the work file, if the type of the transaction information is the farming file, the transaction information is written into the farming file, and the like. Therefore, when the assembly of one transaction message is completed, the transaction message can be written into the corresponding transaction record file, and when all the transaction messages are assembled, all the transaction messages can be written into the corresponding transaction record file.
Specifically, before step 101, the number M of second data sets needs to be determined first. There are many ways to determine the value of M, for example, one way may be: if it is preset that each second data set can process X transaction information, a value of M can be determined according to the total number Y of the transaction information in the N first data sets, that is, M is a value obtained by dividing Y by X, for example, if it is preset that each second data set can process 100 transaction information, and the total number of the transaction information is 100000, the number M of the required second data sets is 100000 divided by 100, that is, M is 1000; the value of M may also be determined in other manners, for example, the total amount of the transaction information corresponds to a value of M in each interval range, for example, when the total amount of the transaction information is 0 to 10000, the value of M is 100, when the total amount of the transaction information is 10001 to 20000, the value of M is 200, and so on.
Optionally, the number M of the second data sets may also be determined according to the following manner:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
In the method, before the number M of the second data sets is determined, a relational database among the system environment, the size of the single transaction information, the number of the transaction information and the number of the second data sets is established through a curve fitting method. The system environment refers to the number of CPUs (Central Processing units), the size of the memory, and the usage rate of the memory, and then a Processing upper limit of each second data set may be set, for example, 200M, so that the functional relationship between the number of the transaction information and the value of the second data set may be fitted on the premise of different system environments and the size of a single transaction information. Fig. 2 is a schematic diagram of a curve fitting function between the transaction information amount and the value of the second data set M under the conditions that the number of CPUs is 4, the memory usage rate is 60G, and the single transaction is 4K. Therefore, after the relationship library between the system environment, the size of the single transaction information, the transaction information amount, and the number of the second data sets is established, only the system environment of the current processing system and the size of one transaction information are needed to obtain the corresponding relationship between the number of the only one second data set and the transaction information amount in the first data set, for example, fig. 2 is a curve fitting function diagram between the transaction information amount and the value of the second data set M under the condition that the number of CPUs is 4, the memory 20G, the memory usage rate is 60%, and the single transaction is 4K, and if the current system environment is the condition, for example, the number of CPUs is 2, the memory 5G, the memory usage rate is 80%, and the single transaction is 4K, one transaction information amount and the transaction information amount corresponding to the system environment can be found from the relationship library between the pre-established system environment, the size of the single transaction information, the transaction information amount, and the number M of the second data sets A curve fitting function between values of the second data set M. After finding the corresponding relationship according to the system environment of the current processing system, as shown in fig. 2, the number M of the second data sets can be determined according to the number of the current transaction information. The method can determine an effective relational database based on experience accumulated in practice, and queries the corresponding relational database according to the actual condition of current processing during subsequent use so as to determine the number M of the second data sets.
In addition, for the M second data sets, each second data set may be processed sequentially, or the M second data sets may be processed in a concurrent manner, that is, in a concurrent manner using multiple processes.
Optionally, determining the number of processes as P, where P is a positive integer, according to the system environment and the size of a piece of transaction information;
determining a second data set to be processed by each process according to the number M of the second data sets and the number P of the processes, including:
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes. In the method, the number of concurrent processes is determined firstly. The method is similar to the method for determining the number M of the second data sets, and before the number of the concurrent processes is determined, a relational database between the system environment and the size of the single transaction information and between the number of the transaction information and the number of the concurrent processes can be established through a curve fitting method. The system environment refers to the number of CPUs (central processing units), the size of a memory and the utilization rate of the memory; then, the upper limit of processing of each second data set may be set, for example, 200M, and the upper limit of the number of concurrent processes may be set, for example, 35, and then, the functional relationship between the number of transaction information and the value of the number of processes may be fitted on the premise of different system environments and the size of a single transaction information. Fig. 3 is a schematic diagram of a curve fitting function between values of the transaction information amount and the process amount under the conditions that the number of CPUs is 4, the memory is 20G, the memory usage rate is 60%, and the single transaction is 4K. Therefore, after the relationship library between the system environment, the size of a single transaction message, the transaction message amount and the value of the process amount is established, the corresponding relationship between the number of the unique second data sets and the process amount can be obtained only by the system environment of the current processing system and the size of a transaction message, as shown in fig. 3, and then the process amount P can be determined according to the amount of the current transaction message.
If the number M of the second data sets is less than or equal to the number P of the processes, randomly distributing the M second data sets to M of the P processes for concurrent processing; if the number M of the second data sets is larger than or equal to the number P of the processes, sequentially allocating the M second data sets to the P processes for processing, namely sequentially allocating the 1 st, the P +1 st, the 2P +1 nd second data sets processed by the process 1, and so on; the process 2 processes the second data set into the 2 nd data set, the P +2 nd data set, the 2P +2 nd data set and the like in sequence; the second data set processed by the process P is the pth, the 2P, the 3P, and so on. For example, there are 3 processes in total and 9 second data sets, then process 1 processes 1, 4, 7 second data sets, process 2 processes 2, 5, 8 second data sets, and process 3 processes 3, 6, 9 second data sets. The method uses a plurality of processes to process M second data sets concurrently, so that the processing speed can be improved to the greatest extent, mass data can be assembled into transaction information more quickly, system resources are saved, and system efficiency is improved.
The following describes the file generation method according to the embodiment of the present invention in detail, and refers to fig. 4, which is a detailed flowchart of the file generation method according to the embodiment of the present invention.
Step 401, determining the number M of the second data sets and the number P of concurrent processes according to the system environment of the current processing system, the number of the transaction information and the size of each transaction information;
step 402, according to the identification of the record information, dividing the record information in the N first data sets into corresponding subfiles of the M second data sets;
step 403, distributing the M second data sets to the P processes for concurrent processing, sequentially reading each subfile for each second data set, and reading each piece of record information in the subfiles into a corresponding data structure in the memory according to the identifier of the record information;
step 404, after each piece of transaction information is assembled, writing the assembled record information into a corresponding file according to the attribute information of the assembled record information.
In step 401, a curve fitting function between a transaction information quantity corresponding to the system environment of the current processing system and a second data set value is found in a pre-established relationship library between the system environment, the number of transaction information and the number of transaction information of a single transaction and the number of second data sets according to the system environment of the current processing system, the number of transaction information and the size of each transaction information, and then the number M of the second data sets can be determined according to the number of the current transaction information; similarly, a curve fitting function between the transaction information quantity and the process quantity corresponding to the system environment of the current processing system is found in a pre-established relationship library among the system environment, the size of the single transaction information, the transaction information quantity and the concurrent process quantity through the system environment of the current processing system, the transaction information quantity and the size of each transaction information, and then the number P of the process quantities can be determined according to the number of the current transaction information.
In step 402, for each piece of record information in the N first data sets, a second data set corresponding to the record information is determined first, and then subfiles in the corresponding second data set are determined. For example, assuming that the number N of the first data sets is 2, and the first data sets are respectively an a table and an a' table, which respectively store the basic information and the additional information of the transaction information, referring to fig. 5, a flowchart of a method for allocating the record information of the first data set to the second data set is shown. Block _ i represents an ith second data set, block _ i _ A represents a file for storing the recording information from the A table in the ith second data set, and block _ i _ A 'represents a file for storing the recording information from the A' table in the ith second data set; block _ j represents the jth second data set, block _ j _ a represents the jth second data set storing the files of the record information from the a table, and block _ j _ a 'represents the jth second data set storing the files of the record information from the a' table, referring to fig. 5, firstly, reading the a table, determining each piece of record information in the table according to the primary key identifier of the piece of record information and the number M of the second data sets by a HASH (HASH) function, and then determining the file to which the record information is written according to the determined second data sets and the a table. For example, a piece of record information in the a table is read, the primary key of the record information is identified as 111, the number M of the second data sets is 100, and the HASH function, HASH (111, 100), determines that the second data set corresponding to the record information is 48, and writes the record information into a file corresponding to the a table in the 48 th second data set, that is, into a file block _48_ a. According to the same method, all the record information in the A table is written into the corresponding file in the second data set corresponding to each record information, after the A table is read, according to the same method, each record in the A ' table is read, all the record information in the A ' table is written into the corresponding file in the second data set corresponding to each record information, and because the A table and the A ' table use the same hash function, the record information with the same primary key identification is distributed to the same second data set. For example, a piece of record information in the a ' table is read, the primary key of the record information is identified as 111, and the HASH function (111, 100) determines that the second data set corresponding to the record information is 48, and writes the record information into a file corresponding to the a ' table in the 48 th second data set, that is, into the file block _48_ a '. Of course, in the method, the a table and the a 'table are read in sequence, and certainly, the a table and the a' table can also be read concurrently by using two processes, so that the speed of generating the file by the mass data can be increased, and the processing efficiency of the system can be improved.
In step 403, P processes may be used to concurrently process M second data sets, where each second data set includes N sub-files, and each second data set corresponds to N first data sets, and the N sub-files respectively store record information from the N first data sets, sequentially read the record information in each sub-file, then write into a pre-allocated data structure, and read into a memory, and then assemble in the memory. For example, the total amount of the transaction information in the first data set is 100000, each transaction information has the same field and includes 100 fields, 100000 arrays may be allocated in advance, and each array may store 100 pieces of record information. Therefore, when reading a piece of record information in the second data set, the piece of record information can be written into the corresponding position of the corresponding array according to the primary key identifier, and the record information with the same primary key identifier is written into the same array, so that after reading a second data set, the record information with the same primary key identifier is written into the same array, and all record information of a piece of transaction information exists in the same second data set, therefore, the whole transaction information is contained in one array, and the array exists in the memory, therefore, after completely reading a second data set into the corresponding array structure of the memory, the transaction information can be assembled by using the memory.
In step 404, after all the record information of one second data set is assembled into transaction information, a transaction record file corresponding to the second data set is generated, then M files corresponding to the M second data sets are generated into a total transaction record file, or certainly, files may be generated in other manners as needed, for example, a transaction information assembled for each second data set may be written into a corresponding file according to the type of the transaction information, for example, the type of the transaction information includes a work file, a farming file, a Chinese file, a posting file, and other files, if the type of the transaction information is the work file, the transaction information is written into the work file, if the type of the transaction information is the farming file, the transaction information is written into the farming file, and so on. Therefore, when the assembly of one transaction message is completed, the transaction message can be written into the corresponding transaction record file, and when all the transaction messages are assembled, all the transaction messages can be written into the corresponding transaction record file.
Based on the same technical concept, the embodiment of the invention also provides file generation equipment which can execute the method embodiment. The file generation device provided by the embodiment of the invention is shown in fig. 6.
A dividing unit 601, configured to divide record information in N first data sets into M second data sets according to identifiers of the record information, where the record information with the same identifier in the N first data sets constitutes a piece of transaction information, fields of the transaction information corresponding to the record information of any two first data sets are not identical, and the record information of each second data set includes fields of a piece of transaction information; wherein N is an integer greater than 1, and M is an integer greater than 1;
a first generating unit 602, configured to, for each second data set, form a piece of transaction information from the record information with the same identifier in the second data set according to a set rule, and generate a transaction record of the second data set;
the second generating unit 603 is configured to obtain transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
Optionally, the second data set includes N sub-files, and the N sub-files correspond to the N first data sets one to one; the dividing unit 601 is specifically configured to:
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set corresponding to the record information according to a first data set in which the record information is located and the second data set corresponding to the record information;
and writing the record information into the corresponding subfile in the second data set corresponding to the record information.
Optionally, the first generating unit 602 is specifically configured to:
reading N sub-files in the second data set;
and writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms transaction information.
Optionally, the dividing unit 601 is further configured to:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
Optionally, the dividing unit 601 is further configured to:
determining the number of processes as P according to the system environment of the current processing system and the data volume of the transaction information, wherein P is a positive integer;
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A file generation method, comprising:
dividing the record information in N first data sets into M second data sets according to the identification of the record information, wherein the record information with the same identification in the N first data sets forms a piece of transaction information, the fields of the transaction information corresponding to the record information of any two first data sets are not completely the same, and the record information of each second data set comprises each field of one piece of transaction information; the fields of the transaction information corresponding to the record information stored in the first data set and the second data set are different; wherein N is an integer greater than 1, and M is an integer greater than 1;
each second data set comprises N sub-files, and the N sub-files correspond to the N first data sets one by one; for each second data set, reading the N sub-files in the second data set; writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms a piece of transaction information, and generates a transaction record of the second data set;
and obtaining the transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
2. The method of claim 1, wherein the dividing the record information in the N first data sets into M second data sets according to the record information identification comprises:
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set corresponding to the record information according to a first data set in which the record information is located and the second data set corresponding to the record information;
and writing the record information into the corresponding subfile in the second data set corresponding to the record information.
3. A method according to claim 1 or 2, characterized in that the number M of second data sets is determined according to the following way:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
4. The method of claim 3, wherein the number of processes is determined as P, which is a positive integer, according to the system environment of the current processing system and the data size of the piece of transaction information;
determining a second data set to be processed by each process according to the number M of the second data sets and the number P of the processes, including:
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes.
5. A file generation apparatus, comprising:
the dividing unit is used for dividing the record information in the N first data sets into M second data sets according to the identification of the record information, the record information with the same identification in the N first data sets forms a piece of transaction information, the fields of the transaction information corresponding to the record information of any two first data sets are not completely the same, and the record information of each second data set comprises the fields of one piece of transaction information; the fields of the transaction information corresponding to the record information stored in the first data set and the second data set are different; wherein N is an integer greater than 1, and M is an integer greater than 1; each second data set comprises N sub-files, and the N sub-files correspond to the N first data sets one by one;
a first generating unit, configured to read the N sub-files in each second data set; writing the read record information with the same identifier in each subfile into a memory according to a set storage structure, wherein the record information corresponding to each storage structure forms a piece of transaction information, and generates a transaction record of the second data set;
and the second generation unit is used for obtaining the transaction records corresponding to the N first data sets according to the transaction records of the M second data sets.
6. The apparatus of claim 5, wherein the partitioning unit is specifically configured to:
for each recorded information in the N first data sets, performing the following operations:
determining a second data set corresponding to the record information according to the corresponding relation between the preset record information identifier and the second data set identifier;
determining a corresponding subfile in a second data set corresponding to the record information according to a first data set in which the record information is located and the second data set corresponding to the record information;
and writing the record information into the corresponding subfile in the second data set corresponding to the record information.
7. The apparatus of claim 5 or 6, wherein the dividing unit is further configured to:
acquiring the corresponding relation between the number of the second data sets and the total transaction information amount according to the system environment of the current processing system and the data amount of the transaction information;
and determining the number M of second data sets corresponding to the N first data sets according to the total amount of the transaction information in the N first data sets.
8. The apparatus of claim 7, wherein the partitioning unit is further configured to:
determining the number of processes as P according to the system environment of the current processing system and the data volume of the transaction information, wherein P is a positive integer;
if M is less than or equal to P, randomly distributing the M second data sets to M of the P processes;
and if M is larger than or equal to P, sequentially distributing the M second data sets to the P processes.
CN201510670633.8A 2015-10-13 2015-10-13 File generation method and device Active CN105931091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510670633.8A CN105931091B (en) 2015-10-13 2015-10-13 File generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510670633.8A CN105931091B (en) 2015-10-13 2015-10-13 File generation method and device

Publications (2)

Publication Number Publication Date
CN105931091A CN105931091A (en) 2016-09-07
CN105931091B true CN105931091B (en) 2020-02-11

Family

ID=56839896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510670633.8A Active CN105931091B (en) 2015-10-13 2015-10-13 File generation method and device

Country Status (1)

Country Link
CN (1) CN105931091B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101425167A (en) * 2007-10-30 2009-05-06 中国银联股份有限公司 Method for generating and parsing trading information
CN102262626A (en) * 2010-05-24 2011-11-30 阿里巴巴集团控股有限公司 Method and device for storing data in database
CN103544593A (en) * 2012-07-09 2014-01-29 中国银联股份有限公司 Method and system for processing records related to terminal transactions
CN104765754A (en) * 2014-01-08 2015-07-08 北大方正集团有限公司 Data storage method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4809989B2 (en) * 2001-03-29 2011-11-09 株式会社日本総合研究所 Data storage method, data storage system, and data storage program
JP2012022386A (en) * 2010-07-12 2012-02-02 Hitachi Information Systems Ltd Data federation system and data importing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101425167A (en) * 2007-10-30 2009-05-06 中国银联股份有限公司 Method for generating and parsing trading information
CN102262626A (en) * 2010-05-24 2011-11-30 阿里巴巴集团控股有限公司 Method and device for storing data in database
CN103544593A (en) * 2012-07-09 2014-01-29 中国银联股份有限公司 Method and system for processing records related to terminal transactions
CN104765754A (en) * 2014-01-08 2015-07-08 北大方正集团有限公司 Data storage method and device

Also Published As

Publication number Publication date
CN105931091A (en) 2016-09-07

Similar Documents

Publication Publication Date Title
US10558672B1 (en) System and method for executing queries on multi-graphics processing unit systems
CN106980649B (en) Method and device for writing block chain service data and service subset determining method
US8224825B2 (en) Graph-processing techniques for a MapReduce engine
CN105630955B (en) A kind of data acquisition system member management method of high-efficiency dynamic
CN102968498A (en) Method and device for processing data
CN106227894B (en) A kind of data page querying method and device
US20140351239A1 (en) Hardware acceleration for query operators
CN110737664A (en) block link point synchronization method and device
CN106407207B (en) Real-time newly-added data updating method and device
CN111813805A (en) Data processing method and device
CN107608773A (en) task concurrent processing method, device and computing device
CN103995908A (en) Method and device for importing data
CN104112008A (en) Multi-table data association inquiry optimizing method and device
CN103902702A (en) Data storage system and data storage method
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
US20140059000A1 (en) Computer system and parallel distributed processing method
WO2021057482A1 (en) Method and device for generating bloom filter in blockchain
CN107977396A (en) A kind of update method of the tables of data of KeyValue databases and table data update apparatus
CN107070645A (en) Compare the method and system of the data of tables of data
CN103500185A (en) Data table generation method and system based on multi-platform data
CN104636349A (en) Method and equipment for compression and searching of index data
CN104572785A (en) Method and device for establishing index in distributed form
CN105468699A (en) Duplicate removal data statistics method and equipment
CN104714983B (en) The generation method and device of distributed index
US8918410B2 (en) System and method for fast identification of variable roles during initial data exploration

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant