WO2023273235A1 - 一种文件的数据比对方法、装置、设备及存储介质 - Google Patents
一种文件的数据比对方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023273235A1 WO2023273235A1 PCT/CN2021/140732 CN2021140732W WO2023273235A1 WO 2023273235 A1 WO2023273235 A1 WO 2023273235A1 CN 2021140732 W CN2021140732 W CN 2021140732W WO 2023273235 A1 WO2023273235 A1 WO 2023273235A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- file
- data
- files
- sorted
- difference
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000005192 partition Methods 0.000 claims abstract description 128
- 238000012545 processing Methods 0.000 claims abstract description 62
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 50
- 238000004140 cleaning Methods 0.000 claims abstract description 39
- 230000015654 memory Effects 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 16
- 238000003491 array Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 19
- 238000005516 engineering process Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 239000012634 fragment Substances 0.000 description 7
- 238000004891 communication Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 238000013515 script Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005294 ferromagnetic effect Effects 0.000 description 1
- 230000005291 magnetic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the embodiment of the present application relates to the technical field of financial technology (Fintech) data processing, and relates to but not limited to a file data comparison method, a file data comparison device, a file data comparison device, and a computer-readable storage medium.
- Fetech technical field of financial technology
- the Lingqiantong product on the WeChat user terminal will initiate a currency fund purchase and redemption transaction in real time, and process the shares held by the user in real time.
- the transaction processing records will be stored in the database persistently, and corresponding reconciliation files will be generated at the end of each day.
- the reconciliation file uses a special protocol format. Each line records a transaction.
- the WeChat financial management system needs to check the content of the reconciliation file and the data of the user's real-time transaction records , and the content of the reconciliation file shall prevail to process the inconsistent data.
- the processing of reconciliation in the related technology is realized by referring to the steps in Figure 1.
- First directly read the reconciliation file, and analyze the content of each line in the reconciliation file; secondly, obtain some key fields through parsing to match the transaction data in the database ;Finally, process the several results of the match.
- Second obtain some key fields through parsing to match the transaction data in the database ;Finally, process the several results of the match.
- processing several matching results if there is no transaction record in the reconciliation file but there is a transaction record in the database, it needs to be deleted and rolled back to process the transaction. If there are transaction records in the reconciliation file but no transaction records in the database, a new transaction needs to be added and processed.
- the embodiment of the present application provides a file data comparison method, a file data comparison device, a file data comparison device, and a computer-readable storage medium, so as to solve the problem of at least large file reading in the related technology during the account reconciliation process. Direct processing while parsing, slow processing efficiency, and time-consuming problems.
- the embodiment of this application provides a data comparison method for files, including:
- the N split files are divided into M data partitions; wherein, each data partition in the M data partitions corresponds to a user identification, so Said that each data partition contains m sub-files;
- a file data comparison device comprising:
- a processing module configured to split the obtained reconciliation file in equal proportions to obtain N split files
- the processing module is configured to divide the N split files into M data partitions according to the user identification of the transaction associated with the reconciliation file; wherein, each data partition in the M data partitions The partition corresponds to a user identifier, and each data partition contains m sub-files;
- the processing module is configured to perform data cleaning and classification on the m sub-files in each data partition according to transaction type and transaction time information, and perform equal ratio splitting on all files after cleaning and classification, Get n files to be sorted;
- the processing module is configured to sort the n files to be sorted in the M data partitions according to the transaction time information, and obtain n sorted files;
- the reconciliation module is configured to perform data comparison on the n sorted files based on a difference comparison algorithm.
- An embodiment of the present application provides a device, including:
- the memory is used to store executable instructions; the processor is used to implement the above method when executing the executable instructions stored in the memory.
- An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions, and is used to cause a processor to execute to implement the above method.
- N split files are obtained; according to the user ID of the transaction associated with the reconciliation file, the N split files are divided into M data partitions; among them, Each of the M data partitions corresponds to a user ID, and each data partition contains m sub-files; according to the transaction type and transaction time information, the m sub-files in each data partition are cleaned and classified, and the cleaning All files after classification are divided into equal proportions to obtain n files to be sorted; according to the transaction time information, n files to be sorted in M data partitions are sorted to obtain n sorted files; based on the difference ratio For the algorithm, data comparison is performed on the n sorted files; that is to say, this application first splits the reconciliation files to realize the analysis and processing of large file fragments, which speeds up the processing performance. Files are sorted, which improves the accuracy of file processing and avoids the phenomenon of processing failures caused by the high probability of directly processing unordered files.
- FIG. 1 is a schematic diagram of an account reconciliation process in the related art
- FIG. 2 is a schematic diagram of an optional architecture of a server provided in an embodiment of the present application.
- Fig. 3 is an optional schematic flow chart of the data comparison method of the files provided by the embodiment of the present application.
- Fig. 4 is a schematic flow chart of file splitting provided by the embodiment of the present application.
- Fig. 5 is a schematic diagram of the result of file splitting provided by the embodiment of the present application.
- Fig. 6 is a schematic flowchart of an overall process of the data comparison method of the files provided by the embodiment of the present application.
- Fig. 7 is a schematic diagram of the result of data cleaning provided by the embodiment of the present application.
- Fig. 8 is a schematic diagram of the result of the file number provided by the embodiment of the present application.
- FIG. 9 is a schematic flowchart of data sorting in a file block provided by an embodiment of the present application.
- Fig. 10 is a schematic diagram of the result of data sorting in the file block provided by the embodiment of the present application.
- Fig. 11 is a schematic diagram of the results of data sorting between two file blocks provided by the embodiment of the present application.
- Fig. 12 is a schematic diagram of the result of data sorting among three file blocks provided by the embodiment of the present application.
- Fig. 13 is a schematic diagram of data sorting between two files with different numbers provided by the embodiment of the present application.
- Fig. 14 is a schematic diagram of the process of exporting files from the database provided by the embodiment of the present application.
- Fig. 15 is a schematic diagram of the results of exporting files from the database provided by the embodiment of the present application.
- Fig. 16 is a schematic flow chart of exporting files from the database provided by the embodiment of the present application.
- Fig. 17 is a schematic diagram of reconciliation files and database files in different partitions provided by the embodiment of the present application.
- Figure 18 is a schematic diagram of the comparison between the reconciliation file and the database file provided by the embodiment of the present application.
- Figure 19 is a schematic diagram of the result of deduplication and retention of difference files between the reconciliation file and the database file provided by the embodiment of the present application;
- FIG. 20 is a schematic flow diagram of deduplication and retention of difference files between the reconciliation file and the database file provided by the embodiment of the present application;
- Fig. 21 is a schematic flow diagram of deduplicating file blocks by calculating the sha1 value provided by the embodiment of the present application.
- Fig. 22 is a schematic diagram of related information of key-value pairs of associated data provided by the embodiment of the present application.
- Fig. 23 is a schematic flow chart of account reconciliation provided by the embodiment of this application.
- the exemplary application of the data comparison device of the file provided by the embodiment of the present application is described below.
- the data comparison device of the file provided by the embodiment of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone) , portable music player, personal digital assistant, dedicated message device, portable game device), intelligent robot and any terminal with screen display function can also be implemented as a server.
- a mobile device for example, a mobile phone
- portable music player portable music player
- personal digital assistant dedicated message device
- portable game device portable game device
- FIG. 2 is a schematic structural diagram of a server 100 provided by an embodiment of the present application.
- the server 100 shown in FIG. Various components in the server 100 are coupled together through the bus system 140 .
- the bus system 140 is used to realize connection and communication between these components.
- the bus system 140 also includes a power bus, a control bus and a status signal bus.
- the various buses are labeled as bus system 140 in FIG. 2 .
- Processor 110 can be a kind of integrated circuit chip, has signal processing capability, such as general-purpose processor, digital signal processor (DSP, Digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware Components, etc., wherein the general-purpose processor can be a microprocessor or any conventional processor, etc.
- DSP digital signal processor
- DSP Digital Signal Processor
- User interface 130 includes one or more output devices 131 that enable presentation of media content, including one or more speakers and/or one or more visual displays.
- the user interface 130 also includes one or more input devices 132, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
- Memory 150 may be removable, non-removable or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, and the like. Memory 150 optionally includes one or more storage devices located physically remote from processor 110 . Memory 150 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile memory can be read-only memory (Read Only Memory, ROM), and the volatile memory can be random access memory (Random Access Memory, RAM). The memory 150 described in the embodiment of the present application is intended to include any suitable type of memory. In some embodiments, the memory 150 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
- Operating system 151 including system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks;
- Network communication module 152 for reaching other computing devices via one or more (wired or wireless) network interfaces 120
- exemplary network interfaces 120 include: Bluetooth, Wireless Compatibility Authentication (Wi-Fi), and Universal Serial Bus (Universal Serial Bus, USB), etc.;
- the input processing module 153 is configured to detect one or more user inputs or interactions from one or more of the input devices 132 and translate the detected inputs or interactions.
- the device provided by the embodiment of the present application can be realized by software.
- FIG. 2 shows a data comparison device 154 of a file stored in the memory 150.
- the data comparison device 154 of the file can be The data comparing device of the file in server 100, it can be the software of forms such as program and plug-in, comprise following software modules: processing module 1541, reconciliation module 1542, these modules are logical, therefore can be according to the function realized Make arbitrary combinations or further splits. The function of each module will be explained below.
- the device provided in the embodiment of the present application may be implemented in hardware.
- the device provided in the embodiment of the present application may be a processor in the form of a hardware decoding processor, which is programmed to execute the The data comparison method of the file provided by the embodiment, for example, the processor in the form of hardware decoding processor can adopt one or more application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), DSP, programmable logic device (Programmable Logic Device, PLD), Complex Programmable Logic Device (Complex Programmable Logic Device, CPLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other electronic components.
- ASIC Application Specific Integrated Circuit
- DSP digital signal processor
- PLD programmable logic device
- Complex Programmable Logic Device Complex Programmable Logic Device
- CPLD Complex Programmable Logic Device
- FPGA Field-Programmable Gate Array
- FIG. 3 is an optional flow chart of the data comparison method for files provided in the embodiment of the present application, which will be described in conjunction with the steps shown in FIG. 3 ,
- Step S201 splitting the acquired reconciliation file into equal proportions to obtain N split files.
- the reconciliation file when the reconciliation file is obtained, the reconciliation file, that is, the large file, is divided into sub-files according to the block ratio according to the large file fragment analysis processing algorithm, and N split files are obtained.
- the divided file ends with a newline character.
- reconciliation file is relatively small, such as a file less than 10MB, then there is no need to split it.
- reconciliation files are large files, such as files larger than 10MB. By splitting the large file in equal proportions, N sub-files are obtained, and then wait for the next step to be processed.
- the following file partitioning according to the customer dimension is Data partition.
- Step S202 according to the user identification of the transaction associated with the reconciliation file, divide the N split files into M data partitions.
- each of the M data partitions corresponds to a user identifier
- each data partition includes m sub-files.
- the user ID assigned to the user by the server such as an account, is associated with the partition number of the data partition, so that N split files can be divided into M data partitions according to the customer dimension.
- the server when a user registers an account, the server generates a globally unique account identifier (Identity document, ID), and the account ID includes the partition number to which the user belongs.
- ID the 16-digit account ID of a user is 0010000000000001
- the first three digits 001 are the partition number
- the last 13 digits are the auto-increment sequence under the current partition. This account is required for each subsequent transaction operation of the user.
- the system is deployed on the server and partitioned according to customers. For example, there are currently 40 partitions. When different customers register accounts in WeBank, they register to one of the 40 partitions according to the preset rules.
- the number of partitions in this application There is no specific limitation, and more partitions can be added according to the demand when expansion is required in the future. Exemplarily, as shown in FIG. 5, a schematic diagram of dividing a certain split file in N split files into three data partitions in M data partitions is shown in FIG. 5, and each of the three data partitions The partitions are numbered 001, 002 and 003.
- the application splits the obtained reconciliation file in equal proportions to obtain N split files, it reads and parses each split file in the N split files line by line, according to the user account belongs to
- the system partition performs file partitioning to generate intermediate sub-files, and each partition generates a collection of sub-files.
- the data in N files needs to be divided into M partitions, and each fragment file under the partition is also stored according to a certain size, here we take 10MB as an example.
- the files generated by the file partition in this step are out of order, and the data is only split according to the partition to which the user account belongs.
- the data is first written to a file. If the file is larger than the set size value, write the second file until all the data is written to the file in the specified partition.
- Step S203 according to the transaction type and transaction time information, perform data cleaning and classification on the m sub-files in each data partition, and split all the files after cleaning and classification into equal proportions to obtain n files to be sorted.
- This application considers that the reconciliation files are generally out of order, and the direct processing has a high probability of failure, and requires two or more analysis and processing of the reconciliation files. However, the processing order of transaction types is required. Therefore, this application follows Transaction type and transaction time information, data cleaning and classification of m sub-files in each data partition of M partitions, and all files after cleaning and classification are divided into equal proportions to obtain n files to be sorted, here According to the two factors of transaction type and transaction time information, the data is cleaned and classified, which effectively improves the efficiency of sorting.
- Step S204 according to the transaction time information, sort the n files to be sorted in the M data partitions to obtain n sorted files.
- n files to be sorted in M data partitions are obtained by data cleaning and classification
- the n files to be sorted in M data partitions are sorted with transaction time information as the sorting reference dimension , to obtain n sorted files.
- Each sorted file is also stored according to a certain size.
- 2MB is taken as an example. In this way, the accuracy of file processing is improved by sorting the files in the partition.
- Step S205 performing data comparison on the n sorted files based on the difference comparison algorithm.
- Fig. 6 shows the overall flow of the data comparison method of the file of the present application, first, the reconciliation file, that is, the large file, is split into N sub-files; Then partition according to the user dimension, divide the N sub-files into M data partitions, each partition contains m sub-files, and then perform data cleaning on the m sub-files of each partition according to the partition, and get each There are n sub-files in the partition, and finally perform business logic data processing such as sorting processing on each sub-file in the n sub-files according to the partition, and then perform data comparison on the processed data.
- the method of splitting and then sorting provided by this application processes large files, which improves the efficiency of file reading and the accuracy of account reconciliation.
- the data comparison method of the documents provided in this application is to obtain N split files by splitting the obtained reconciliation files in equal proportions; split the N split files according to the user identification of the transaction associated with the reconciliation files Files are divided into M data partitions; among them, each data partition in the M data partitions corresponds to a user ID, and each data partition contains m sub-files; according to the transaction type and transaction time information, m in each data partition Perform data cleaning and classification on sub-files, and divide all the files after cleaning and classification into equal proportions to obtain n files to be sorted; sort the n files to be sorted in M data partitions according to the transaction time information, Obtain n sorted files; based on the difference comparison algorithm, perform data comparison on the n sorted files; that is to say, this application first splits the reconciliation files to realize large file fragment analysis and processing, and speed up The processing performance is improved, and further, the files in the partition are sorted, which improves the accuracy of file processing and avoids the phenomenon of processing failures caused by the high probability of directly processing
- step S203 performs data cleaning and classification on the m sub-files in each data partition according to the transaction type and transaction time information, and performs equal splitting on all files after cleaning and classification to obtain n sub-files
- the files to be sorted can be achieved through the following steps:
- A11 read each of the m sub-files, and traverse the transaction type and transaction time information of each line of data in each sub-file.
- A12 process all rows of data in each sub-file according to the cleansing and classification conditions of the jth type of transaction type and the transaction time information is one hour, and obtain all the files after cleaning and classification.
- the transaction type includes the jth type of transaction type. It should be noted that the data in all the files after cleaning and classification are out of order. In the embodiment of this application, the data is sorted according to the transaction time.
- the transaction types include at least subscription and redemption.
- Figure 7 shows the files after cleaning and sorting the files in the above-mentioned partition, including, for example: purchase transaction data for 09 hours, that is, transaction data whose transaction type is purchase and transaction time is 09 hours; redemption transaction data for 09 hours , that is, the transaction data of the transaction type is redemption and the transaction time is 09 hours; and the transaction data of redemption 10 hours, that is, the transaction data of the transaction type is redemption and the transaction time is 10 hours.
- the W split files include w files to be sorted that have the jth transaction type and the transaction time information is one hour, and the w files corresponding to all transaction types form n files to be sorted.
- the file data in each partition is grouped into different files according to transaction type and hour range. After the data is cleaned and classified, the file data in each partition is grouped into different files according to the transaction type and hour range.
- the data volume of some hourly transactions may be relatively large. According to the previous file segmentation principle, after the file size reaches 10MB, the data will be divided into the second file for storage. Therefore, there may be w files under the same transaction type and the same hour file, that is, many transaction files.
- step S204 sorts the n files to be sorted in the M data partitions according to the transaction time information, and obtains the n sorted files, which can be realized through the following steps:
- each file is 10MB as an example, the w one-hour files are numbered 1 to w, as shown in Figure 8, the transaction type is subscription and the transaction time is 09 There are w files corresponding to the point, and each file in the w files is numbered to obtain multiple files with numbers 1 to w, including: 09 purchase hour transaction data file 1, 09 purchase hour transaction data file 2, 09 purchase hour transaction data file 3...09 purchase hour transaction data file w. At this point the data in all files is out of order.
- the following naming method is used to name the file " ⁇ transaction type ⁇ _transaction time period_file number”. For example, the file name for purchasing 09 hours is 0_09_000001, and the file name for redeeming 09 hours is 1_09_000001, and the number is incremented by a 6-digit integer.
- A22 for the file number i in the number 1 to the number w, read the file blocks of the preset size in the file number i in parallel each time, and obtain multiple file blocks with the same size of the number i.
- the sorting process is performed in parallel for files with different numbers among the w files.
- the sorting of the file numbered i such as the file numbered 1
- the same sorting method is adopted for other numbered files.
- the size of file No. 1 is 10MB, and the file block of 2MB is read each time, then the file No. 1 will be read in 5 equal parts.
- A23 read the file block k in the multiple file blocks with the same size of the number i, analyze each line of data in the file block k in parallel, and obtain the transaction time information of each line of data in the file block k.
- read file block k in multiple file blocks with the same size number 1 analyze each line of data in file block k in parallel, and obtain the transaction time information of each line of data in file block k, here, read the first A piece of 2MB data, analyze each line of data line by line, and obtain the transaction time as the basis for sorting.
- A24 if the i+1th row of data in the file block k is read, compare the i+1th row of data with the previous i row of data, determine the target position of the i+1th row of data in the file block k, and The i+1th line of data is inserted into the target position, and the sorted file block k is obtained.
- the transaction time of the i+1th line of data at the target position in the sorted file block k is after the transaction time of the i-th line of data at the previous adjacent position of the target position, and after the transaction time of the i-th line of data at the target position Before the transaction time of the i+2th row of data in an adjacent position.
- each time a row of data is read it is compared with the previous data of the row data read in file block k, find a position that is greater than or equal to the previous time and less than the latter time, and insert that row of data into that position , the following data is moved backward by one line. Rewrite the sorted file blocks to the first 2MB file block of the file numbered 1 to implement sorting for file block k.
- file block 1 contains 6 lines of data. After reading the first line of data, only compare with the data of the next line. Since 090002 is smaller than the next line of 092005, the first line The position of the data remains unchanged; after reading the second line of data, when 092005 is greater than 090002, compared with the data of the next line, 092005 is still greater than 090102, indicating that the position of the line data corresponding to 092005 should be exchanged with the position of the line data corresponding to 090102.
- read the third line of data if 092005 is greater than 090102 and less than 094002, continue to read the subsequent line data for sorting until the time corresponding to each line of data in file block 1 is greater than or equal to the previous time and less than the latter time, then for the file
- the sorting of block 1 is completed, and the sorted file block 1 is shown in FIG. 10 .
- file block 2 For file block 2, file block 3, file block 4, and file block 5 of number 1, the same sorting method as file 1 is also used for sorting, and each file block k after sorting is rewritten into the file number i, and the content The sorting results are persisted to disk to implement sorting for each file block.
- what A24 obtains is the sorted file blocks for file block k, and further, uses multi-line matching for sorting, so as to realize the sorting among multiple file blocks numbered i.
- file block 1 and file block 2 For the sorting between file block 1 and file block 2, read file block 1 and file block 2, and find m consecutive lines of file block 2 (this m may be 1), which is greater than or equal to the previous time in file block 1 and less than The position of the later time, insert these m lines into this position, and at the same time move the following lines down from the first file block, and the last m lines of file block 1 will move down m lines to file block 2, and move down to The m lines of file block 2 are also reversely compared, and moved to the position corresponding to the sorting of file block 2, so as to achieve the purpose of sorting the two file blocks.
- file block 1 and file block 2 are sorted in memory, and the sorted result is rewritten back to the first and second 2MB file blocks numbered 1.
- file block 3 is compared with file block 1 and file block 2 respectively, finds a suitable position, and moves m lines in file block 3 to a suitable position of file block 1 or file block 2, Correspondingly, after the insertion, the extra m lines are moved down to file block 2 or file block 3. If they are moved to file block 2, the remaining m lines after sorting will continue to be moved down to file block 3, finally achieving the purpose of sorting the three file blocks .
- File block 4 and file block 5 are also processed in the same way.
- File block 4 is compared with file block 1, file block 2, and file block 3 to select an appropriate position for insertion and sorting.
- File block 5 is respectively compared with file block 1, file block 2, and file block 1.
- Block 3 is compared with file block 4, and an appropriate position is selected for insertion sorting. Finally, the sorting among multiple file blocks numbered 1 is completed.
- the files numbered 2 to w are sorted using the sorting method of the file numbered 1, and the respective sorting of files with different numbers can be processed in parallel, using technologies such as multi-threading or distributed clusters, so that w The files are sorted individually.
- the sorting method of expansion number 1 is aimed at the sorting of multiple file blocks, as shown in Figure 13 below, the p-line data of file block 1 of file number 2 is compared with file block 1 of file number 1, if there is a suitable position , insert p lines of data into this position, and then move down p lines of data in all file blocks of file number 1 to file number 2, and continue to reverse to find a suitable position to store.
- the cycle repeats, and all data lines of all file blocks of file number 1 and file number 2 are sorted.
- step S205 performs data comparison on the n sorted files based on the difference comparison algorithm, which can be implemented through the following steps:
- A31, the first row transaction time field and the last row transaction field of all file blocks sorted by number i are exported from the database and database file i.
- all the sorted file blocks with the number i have the same data partition identifier as the database file i.
- data is exported from the database according to the same rules as the aforementioned data cleaning, and sub-files are exported according to the framed data type and transaction time range.
- To export data from the database first, read the time range of the sorted files in the partition. Taking a single file as an example, directly read the transaction time fields of the first line and the last line of the file. Second, export transaction data based on scope.
- the transaction type in the process of exporting files from the database, can be obtained according to the file name, so that the time range can be framed when exporting transaction records from the database, and the database script can also be used to directly sort.
- the file name can be named according to the rules, and the naming rules of the file exported by the database are prefixed with "db_" before the previous reconciliation file name.
- the reconciliation file name after sorting in the partition is "0_09_000001”
- the database export file name is "db_0_09_000001”
- the database export files there are two situations for sorting the database export files.
- the first case if a partition has only one database, then the database export files have been sorted according to the rules in the step.
- the corresponding database export files are "db1_0_09_000001", "db2_0_09_000001”, and "db3_0_09_000001”.
- the data rows of these three files are sorted and merged into one file.
- the database file 1 is exported from the database, and the file name of the exported database file 1 for db_09_000001.
- the transaction type can also be judged, and if the transaction type does not match, the export will be stopped.
- A32 Calculate first hash values of all sorted file blocks with number i based on a difference comparison algorithm.
- the difference comparison algorithm includes but not limited to the message digest algorithm md5.
- A35 based on the data matching algorithm, remove the same file block from the database file i in all the sorted file blocks with the number i, and filter out the first difference file block in the database file i and all the sorted files with the number i The second diff file chunk in the chunk.
- the identical files are deduplicated, and the remaining files are the files with differences between the reconciliation file and the database export file.
- the files with differences because they are sorted, the blocks of most consecutive lines may also be equal.
- the file block data matching method can be used to remove the same file blocks and leave the difference parts.
- the files that are completely consistent are deleted, and the remaining files are the files with differences between the reconciliation file and the database export file.
- A36 Determine difference information between the first difference file block in the database file i and the second difference file block in all sorted file blocks numbered i, and perform data comparison based on the difference information.
- A35 removes from the database file i the file blocks that are identical to all the file blocks sorted by the number i, and filters out the first difference file block in the database file i and the file blocks of the number i.
- the second difference file block in all file blocks after sorting can be realized through the following steps:
- A351 if all the sorted file blocks numbered i are different from the number of rows of data contained in the database file i, remove all the sorted file blocks numbered i and the first row of data and the last row of data in the database file i at least once, Obtain all the sorted file blocks of number i after delineation and the delineated file blocks of database file i.
- A354 if the third hash value is different from the fourth hash value, it is determined that all the sorted file blocks of the number i are different from the file blocks of the database file i after the row removal.
- A355 based on the data matching algorithm, remove the same file blocks from all the file blocks after the row removal of the number i from the file blocks after the row removal of the database file i, and filter out the row removal files of the database file i The third difference file block in the file block and the fourth difference file block in all the sorted file blocks with number i.
- A356 remove the fourth difference file block of number i and the tail row data of the third difference file block of database file i at least once, and filter out the first difference file block and the second difference file block of number i in database file i; wherein , the first difference file block in the database file i and the second difference file block numbered i have the same tail row transaction time.
- the first case is that the number of rows in the 0_09_0000002 file is greater than db_0_09_000002.
- the number of rows in the 0_09_0000002 file is equal to db_0_09_000002.
- the third case is that the number of rows in the 0_09_0000002 file is less than db_0_09_000002.
- step 7) If the two sha1 values are equal, remove the equal file blocks in the two files, and leave the previously excluded lines in the two files. to step 7)
- two associative arrays can be used to store the sha1 values and data rows of the two difference files respectively.
- A36 determines the difference information between the first difference file block in the database file i and the second difference file block in all the file blocks of number i sorted, and performs data comparison based on the difference information, It can be achieved through the following steps:
- A361 calculating the fifth hash value of the first difference file block in the database file i based on the difference comparison algorithm, using the fifth hash value as a key and using the data row of the first difference file block in the database file i as Values are recorded as the first associative array of database file i.
- A362 calculate the sixth hash value of the second difference file block of number i based on the difference comparison algorithm, and use the sixth hash value as a key and record the data row of the second difference file block of number i as a value as a number The second associative array of i.
- A363 in each partition, compare the key of the first associative array of database file i with the key of the second associative array of number i, remove the data rows with the same key in the two associative arrays, and obtain the third associative array of database file i
- the associative array has a fourth associative array with number i.
- the key in the third associative array of database file i is the transaction serial number of each row of data and the value is the data row; the key in the fourth associative array of number i is the transaction serial number of each row of data and the value is the data row .
- an associative array whose key is the transaction serial number of each row of data and whose value is the data row is obtained.
- the second associative array of number i is represented by associative array A
- the first associative array of database file i is represented by associative array B
- the fourth associative array of number i is represented by associative array C
- the associative array of database file i is represented by associative array C.
- the third associative array is represented by an associative array D.
- A364. Determine the difference information between the third associative array of the database file i and the fourth associative array of number i, and perform data comparison based on the difference information.
- A364 determines the difference information between the third associative array of database file i and the fourth associative array of number i, and performs data comparison based on the difference information, which can be achieved by the following steps:
- A3641 if there is a first key in the fourth associative array of number i that does not exist in the third associative array of database file i, determine the difference information indicating that the reconciliation file has data that is not in the database, and determine the first key of number i based on the first key For the data rows corresponding to the four associative arrays, the data row corresponding to the fourth associative array with the number i is added to the third associative array of the database file i.
- A3642 if the second key in the third associative array of database file i does not exist in the fourth associative array of number i, determine the difference information indicating that the reconciliation file does not have data in the database, and determine the database file i based on the second key The data row corresponding to the third associative array, and delete the data row corresponding to the third associative array of the database file i.
- each partition traverse the two associative arrays separately. Compare the keys of all associative arrays A with the keys of all associative arrays B, and exclude data with the same key in the two associative arrays. Remap the remaining data in the two associative arrays A and B into a new associative array C and associative array D, take the unique transaction serial number of each row of data as the key, and the data row as the value.
- the first type the key that exists in the associative array C but does not exist in the associative array D, indicates that the reconciliation file has data that is not in the database. You need to use the key to associate the data row corresponding to the array C, and increase the data row corresponding to the array C associated with the key Into the associative array D, in this way, add a new transaction serial number in the associative array D, and add transaction data for the newly added transaction serial number.
- associative array C does not exist, and the key of associative array D exists, indicating that the reconciliation file does not have data in the database. You need to use the key to associate the data row corresponding to array D, and delete the data row corresponding to the key-associated array D In this way, redundant and incorrect transaction information in the database can be deleted.
- the third type the key that exists in the associative array C and the associative array D also exists, indicating that the data exists in both the reconciliation file and the database, but the transaction data is inconsistent (because the algorithm in front of the consistent data has been removed), you need to use the key to re-associate For the data row corresponding to the array C, replace the data row corresponding to the array D associated with the key with the data row corresponding to the array C associated with the key, so as to ensure that the data in the database is consistent with the data in the reconciliation file.
- the file and data difference comparison algorithm of this application adopts the sorting in the large file fragment analysis processing algorithm and the partition design, which is convenient for parallel computing in distributed systems, and the file block data matching efficiency will be relatively high; at the same time The correctness of file processing is high, which greatly reduces the probability of file processing failure.
- the software module can be a data comparison device for files in the server 100, including:
- the processing module 1541 is configured to split the obtained reconciliation file into equal proportions to obtain N split files;
- the processing module 1541 is configured to divide the N split files into M data partitions according to the user ID associated with the transaction associated with the reconciliation file; wherein, each data partition in the M data partitions corresponds to a user ID, and each A data partition contains m sub-files;
- the processing module 1541 is used to perform data cleaning and classification on the m sub-files in each data partition according to the transaction type and transaction time information, and split all the files after cleaning and classification into equal proportions to obtain n items to be sorted document;
- the processing module 1541 is configured to sort the n files to be sorted in the M data partitions according to the transaction time information, and obtain n sorted files;
- the reconciliation module 1542 is configured to perform data comparison on the n sorted files based on the difference comparison algorithm.
- the processing module 1541 is used to read each sub-file in the m sub-files, traverse the transaction type and transaction time information of each line of data in each sub-file; for all lines in each sub-file
- the data is processed according to the cleaning and classification conditions with the jth type of transaction type and the transaction time information is one hour, and all the files after cleaning and classification are obtained;
- the transaction type includes the jth type of transaction type;
- the cleaning and classification of All files are split in equal proportions to obtain W split files; among them, W split files include w files to be sorted that have the jth type of transaction type and the transaction time information is one hour, and all transaction types correspond to W files form n files to be sorted.
- the processing module 1541 is configured to number each file in each of the w files to obtain a plurality of files numbered 1 to w; according to the file memory mapping method, the numbers in the numbers 1 to w are For the file i, read the file blocks of the preset size in the file number i in parallel each time, and obtain multiple file blocks of the same size with the number i; read the file block k of the multiple file blocks with the same size with the number i , analyze each line of data in file block k in parallel to obtain the transaction time information of each line of data in file block k; Compare the data, determine the target position of the i+1th row of data in the file block k, and insert the i+1th row of data into the target position to obtain the sorted file block k; wherein, the sorted file block k is located in The transaction time of the i+1th line of data at the target position is after the transaction time of the i-th line of data at the previous adjacent position of the target position, and is at the i
- the reconciliation module 1542 is used to export the first line transaction time field and the last line transaction field of all file blocks sorted by number i from the database and database file i; wherein, after sorted by number i All file blocks of database file i have the same data partition identifier; calculate the first hash value of all sorted file blocks of number i based on the difference comparison algorithm; calculate the second hash value of database file i based on the difference comparison algorithm column value; if the first hash value is different from the second hash value, it is determined that all the file blocks sorted by the number i are different from the database file i; based on the data matching algorithm, the sorted block with the number i is removed from the database file i The same file blocks in all the file blocks of the database file i, filter out the first difference file block in the database file i and the second difference file block in all the file blocks sorted by number i; determine the first difference file in the database file i Difference information between the block and the second difference file block among all the
- the reconciliation module 1542 is configured to remove all the sorted file blocks of number i and the database at least once if the number of rows of data contained in the sorted file blocks of number i is different from that of the database file i
- the file block and the file block after the row removal of the database file i have the same transaction period;
- the third hash value of all the file blocks after the row removal and sorting of the number i is calculated based on the difference comparison algorithm; based on the difference comparison algorithm Calculate the fourth hash value of the file block after the row removal of the database file i; if the third hash value is different from the fourth hash value, determine all the file blocks and database file i after the row removal of the number i
- the delineated file blocks are different; based on the data matching algorithm, remove the same file blocks from the delineated file
- the reconciliation module 1542 is configured to calculate the fifth hash value of the first difference file block in the database file i based on the difference comparison algorithm, and use the fifth hash value as a key and store the database file i
- the data row of the first difference file block in is recorded as the first associative array of database file i as a value;
- the sixth hash value of the second difference file block number i is calculated based on the difference comparison algorithm, and the sixth hash value
- the value is used as a key and the data row of the second difference file block of the number i is recorded as the second associative array of the number i as a value; in each partition, the key of the first associative array of the database file i and the number i's first
- the key comparison of the two associative arrays removes the data rows with the same key in the two associative arrays, and obtains the third associative array of the database file i and the fourth associative array of the number i; wherein, the key in the third as
- the reconciliation module 1542 is configured to determine whether the reconciliation file has a database file or not, if the first key that does not exist in the third associative array of the database file i exists in the fourth associative array with the number i Data, determine the data row corresponding to the fourth associative array of number i based on the first key, and add the data row corresponding to the fourth associative array of number i in the third associative array of database file i; if the fourth associative array of number i The second key that exists in the third associative array of the database file i does not exist in the database file i, determine that the difference information represents the data that the reconciliation file does not have in the database, determine the data row corresponding to the third associative array of the database file i based on the second key, and Delete the data row corresponding to the third associative array of database file i; if the third key existing in the third associative array of database file i exists in the fourth associative array of number i,
- the file data comparison device obtains N split files by splitting the obtained reconciliation files in equal proportions; splits the N split files according to the user identification of the transaction associated with the reconciliation file Files are divided into M data partitions; among them, each data partition in the M data partitions corresponds to a user ID, and each data partition contains m sub-files; according to the transaction type and transaction time information, m in each data partition Perform data cleaning and classification on sub-files, and divide all the files after cleaning and classification into equal proportions to obtain n files to be sorted; sort the n files to be sorted in M data partitions according to the transaction time information, Obtain n sorted files; based on the difference comparison algorithm, perform data comparison on the n sorted files; that is to say, this application first splits the reconciliation files to realize large file fragment analysis and processing, and speed up The processing performance is improved, and further, the files in the partition are sorted, which improves the accuracy of file processing and avoids the phenomenon of processing failures caused by the high probability of directly processing unordered
- the embodiments of the present application provide a storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will be caused to execute the method provided in the embodiments of the present application.
- the storage medium can be a computer-readable storage medium, for example, a ferroelectric memory (FRAM, Ferromagnetic Random Access Memory), a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), flash memory, magnetic surface memory, optical disc, Or memory such as CD-ROM (Compact Disk-Read Only Memory); It can also be various devices including one or any combination of the above-mentioned memories.
- FRAM Ferroelectric memory
- ROM Read Only Memory
- PROM programmable read-only memory
- EPROM Erasable Programmable Read Only Memory
- EEPROM Electrically Erasable Programmable Read Only Memory
- flash memory magnetic surface memory
- optical disc Or memory such as CD-ROM (Compact Disk-Read Only Memory); It can also be various
- executable instructions may take the form of programs, software, software modules, scripts, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and its Can be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
- executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in Hyper Text Markup Language (HTML, Hyper Text Markup Language) in one or more scripts in a document, in a single file dedicated to the program in question, or in multiple cooperating files (for example, a file that stores one or more modules, subroutines, or code sections )middle.
- executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network. to execute.
- the embodiment of the present application provides a file data comparison method, device, equipment, and storage medium, by splitting the acquired reconciliation file in equal proportions to obtain N split files; according to the exchange associated with the reconciliation file With a user ID, divide N split files into M data partitions; wherein, each data partition in the M data partitions corresponds to a user ID, and each data partition contains m sub-files; according to the transaction type and transaction time Information, clean and classify the m sub-files in each data partition, and split all the files after cleaning and classification into equal proportions to obtain n files to be sorted; according to the transaction time information, M data partitions
- the n files to be sorted are sorted to obtain n sorted files; based on the difference comparison algorithm, data comparison is performed on the n sorted files; that is to say, the application first splits the reconciliation files , to achieve large file fragment analysis and processing, which speeds up the processing performance. Further, sorting the files in the partition improves the accuracy of file processing, and avoids the phenomenon of processing failure caused by
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种文件的数据比对方法、装置、设备及存储介质,涉及金融科技的数据处理技术领域。所述方法包括:对获取到的对账文件进行等比拆分,得到N个拆分文件(S201);按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中(S202),其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件(S203);按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件(S204);基于差异比对算法,对n个排序后的文件进行数据比对(S205)。
Description
相关申请的交叉引用
本申请基于申请号为202110724780.4、申请日为2021年06月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
本申请实施例涉及金融科技(Fintech)的数据处理技术领域,涉及但不限于一种文件的数据比对方法、文件的数据比对装置、文件的数据比对设备及计算机可读存储介质。
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,然而,由于金融行业的安全性、实时性要求,金融科技也对技术提出了更高的要求。
金融科技领域下,微众银行的交易类产品的用户和交易量都非常大,面对上亿的存量用户和单日交易,核对用户每天的交易是否正确处理是一个难题。例如微信用户端下零钱通产品会实时发起货币基金申购赎回交易,会实时处理用户持有的份额,交易处理记录会持久化存储到数据库,每天日终会产生对应的对账文件,针对申购交易会有一个申购日终对账文件,针对赎回会有一个赎回日终对账文件。对账文件使用特殊的协议格式,每一行记录一笔交易,上亿笔交易通过对账文件发给微信理财系统,微信理财系统需要将对账文件的内容和用户实时交易记录的数据进行一个核对,并且需要以对账文件内容为准处理核对不一致的数据。
相关技术中针对对账的处理,参照图1中的步骤实现,首先,直接读取对账文件,解析对账文件中每一行内容;其次,通过解析得到一些关键字段匹配数据库中的交易数据;最后,处理匹配的几种结果。在处理匹配的几种结果时,如果对账文件不存在交易记录,数据库存在交易记录,需要删除并回退处理交易。如果对账文件存在交易记录,数据库不存在交易记录,需要新增并处理交易。如果对账文件存在交易记录,数据库存在交易记录;此时,有两种情况,一种是交易数据不一致,需要已对账文件为准处理交易,另一种是交易数据一致,说明对账相符,无需处理。可见,相关技术在对账过程中至少存在大文件读取时直接边解析边处理,处理效率慢,耗时长的问题。
发明内容
本申请实施例提供一种文件的数据比对方法、文件的数据比对装置、文件的数据比对设备及计算机可读存储介质,以解决相关技术在对账过程中至少存在大文件读取时直接边解析边处理,处理效率慢,耗时长的问题。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种文件的数据比对方法,包括:
对获取到的对账文件进行等比拆分,得到N个拆分文件;
按照所述对账文件关联的交易所具有的用户标识,将所述N个拆分文件分到M个数据分区中;其中,所述M个数据分区中每一数据分区对应一个用户标识,所述每一数据分区包含m个子文件;
按照交易类型和交易时间信息,对所述每一数据分区中的所述m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;
按照所述交易时间信息,对所述M个数据分区中的所述n个待排序文件进行排序,得到n个排序后的文件;
基于差异比对算法,对所述n个排序后的文件进行数据比对。
一种文件的数据比对装置,包括:
处理模块,用于对获取到的对账文件进行等比拆分,得到N个拆分文件;
所述处理模块,用于按照所述对账文件关联的交易所具有的用户标识,将所述N个拆分文件分到M个数据分区中;其中,所述M个数据分区中每一数据分区对应一个用户标识,所述每一数据分区包含m个子文件;
所述处理模块,用于按照交易类型和交易时间信息,对所述每一数据分区中的所述m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;
所述处理模块,用于按照所述交易时间信息,对所述M个数据分区中的所述n个待排序文件进行排序,得到n个排序后的文件;
对账模块,用于基于差异比对算法,对所述n个排序后的文件进行数据比对。
本申请实施例提供一种设备,包括:
存储器,用于存储可执行指令;处理器,用于执行存储器中存储的可执行指令时,实现上述的方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现上述的方法。
本申请实施例具有以下有益效果:
通过对获取到的对账文件进行等比拆分,得到N个拆分文件;按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中;其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件;基于差异比对算法,对n个排序后的文件进行数据比对;也就是说,本申请先针对对账文件进行拆分,实现大文件分片解析处理,加快了处理性能,进一步地,针对分区中的文件进行排序,提高了文件处理的精准度,避免了直接处理无序文件较大概率导致处理失败的现象。
图1是相关技术中的一种对账流程示意图;
图2本申请实施例提供的一种服务器的一个可选的架构示意图;
图3是本申请实施例提供的文件的数据比对方法的一个可选的流程示意图;
图4是本申请实施例提供的文件拆分的流程示意图;
图5是本申请实施例提供的文件拆分的结果示意图;
图6是本申请实施例提供的文件的数据比对方法的一个整体流程示意图;
图7是本申请实施例提供的数据清洗的结果示意图;
图8是本申请实施例提供的文件编号的结果示意图;
图9是本申请实施例提供的文件块内数据排序的流程示意图;
图10是本申请实施例提供的文件块内数据排序的结果示意图;
图11是本申请实施例提供的两个文件块之间数据排序的结果示意图;
图12是本申请实施例提供的三个文件块之间数据排序的结果示意图;
图13是本申请实施例提供的两个不同编号的文件之间数据排序的示意图;
图14是本申请实施例提供的从数据库导出文件的过程示意图;
图15是本申请实施例提供的从数据库导出文件的结果示意图;
图16是本申请实施例提供的从数据库导出文件的流程示意图;
图17是本申请实施例提供的不同分区中的对账文件与数据库文件的示意图;
图18是本申请实施例提供的对账文件与数据库文件的对比示意图;
图19是本申请实施例提供的对账文件与数据库文件之间去重保留差异文件的结果示意图;
图20是本申请实施例提供的对账文件与数据库文件之间去重保留差异文件的流程示意图;
图21是本申请实施例提供的通过计算sha1值去重文件块的流程示意图;
图22是本申请实施例提供的关联数据的键值对的相关信息的示意图;
图23是本申请实施例提供的对账的流程示意图。
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
下面说明本申请实施例提供的文件的数据比对设备的示例性应用,本申请实施例提供的文件的数据比对设备可以实施为笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备),智能机器人等任意具有屏幕显示功能的终端,也可以实施为服务器。下面,将说明文件的数据比对设备实施为服务器时的示例性应用。
参见图2,图2是本申请实施例提供的服务器100的结构示意图,图2所示的服务器100包括:至少一个处理器110、至少一个网络接口120、用户接口130和存储器150。服务器100中的各个组件通过总线系统140耦合在一起。可理解,总线系统140用于实现这些组件之间的连接通信。总线系统140除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统140。
处理器110可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口130包括使得能够呈现媒体内容的一个或多个输出装置131,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口130还包括一个或多个输入装置132, 包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器150可以是可移除的,不可移除的或其组合。示例性地硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器150可选地包括在物理位置上远离处理器110的一个或多个存储设备。存储器150包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(Read Only Memory,ROM),易失性存储器可以是随机存取存储器(Random Access Memory,RAM)。本申请实施例描述的存储器150旨在包括任意适合类型的存储器。在一些实施例中,存储器150能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统151,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块152,用于经由一个或多个(有线或无线)网络接口120到达其他计算设备,示例性地网络接口120包括:蓝牙、无线相容性认证(Wi-Fi)、和通用串行总线(Universal Serial Bus,USB)等;
输入处理模块153,用于对一个或多个来自一个或多个输入装置132之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可以采用软件方式实现,图2示出了存储在存储器150中的一种文件的数据比对装置154,该文件的数据比对装置154可以是服务器100中的文件的数据比对装置,其可以是程序和插件等形式的软件,包括以下软件模块:处理模块1541、对账模块1542,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的文件的数据比对方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(Application Specific Integrated Circuit,ASIC)、DSP、可编程逻辑器件(Programmable Logic Device,PLD)、复杂可编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或其他电子元件。
下面将结合本申请实施例提供的服务器100的示例性应用和实施,说明本申请实施例提供的文件的数据比对方法。参见图3,图3是本申请实施例提供的文件的数据比对方法的一个可选的流程示意图,将结合图3示出的步骤进行说明,
步骤S201,对获取到的对账文件进行等比拆分,得到N个拆分文件。
本申请实施例中,在获取到对账文件的情况下,按照大文件分片解析处理算法,将对账文件即大文件按块等比切分为子文件,得到N个拆分文件,切分的文件以换行符结束。这里,将对账文件拆分为子文件,可以使用分布式系统并行计算的优势,同时对每一子文件进行处理,加快处理性能。
本申请其他实施例中,参见图4所示,如果对账文件比较小,例如小于10MB的文件,则无需拆分,待执行对账时,直接使用本申请提供的差异比对算法进行数据比对即可。一般情况下,对账文件都是大文件,例如大于10MB的文件,通过对大文件进行等比拆分,得到N个子文件,进而等待下一步处理,例如下述的按照客户维度进行文件分区即数据分区。
步骤S202,按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中。
其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件。这里,服务器分配给用户的用户标识例如账号关联有数据分区的分区编号,如此可以实现按照客户维度将N个拆分文件分到M个数据分区中。
本申请实施例中,用户在注册帐号的时候,服务器生成一个全局唯一的帐号标识(Identity document,ID),帐号ID包含用户所属的分区编号。例如某一用户的16位帐号ID是0010000000000001,前三位001是分区编号,后13位是当前分区下自增序列,后续该用户的每个交易操作都需要用到这个帐号。
系统部署在服务器上,按照客户进行分区,例如现有40个分区,不同的客户在微众银行注册账号的时候,按照预设规则注册到40个分区中的一个分区,本申请对分区的数量不做具体地限定,在后续需要扩容的时候,可以根据需求增加更多的分区。示例性的,参考图5所示,图5中示出了将N个拆分文件中的某一拆分文件分到M个数据分区中的三个数据分区的示意,三个数据分区各自的分区编号为001、002和003。
也就是说,本申请在对获取到的对账文件进行等比拆分,得到N个拆分文件之后,逐行读取解析N个拆分文件中的每个拆分文件,按照用户帐号所属的系统分区进行文件分区,生成中间分片子文件,每个分区生成了一些子文件分片集合。这里,N个文件里的数据需要且分到M个分区,分区下的每个分片文件也是按某个大小存放的,这里以10MB为例。需要说明的是,这一步文件分区生成的文件是无序的,仅仅是按照用户帐号所属的分区做一个数据的拆分,到分区之后,先往一个文件写数据,如果文件大于设定的大小值,则新写第二个文件,直到所有的数据都写到指定分区的文件里面。
步骤S203,按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件。
本申请考虑到对账文件一般是无序的,直接处理较大概率处理失败,需要二次或多次解析处理对账文件,然而,交易类型的处理顺序是有要求的,因此,本申请按照交易类型和交易时间信息,对M个分区的每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件,这里,按照交易类型和交易时间信息这两个因素对数据进行清洗归类,有效地提高了排序的效率。
步骤S204,按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件。
本申请实施例中,在数据清洗归类得到M个数据分区中的n个待排序文件的情况下,以交易时间信息为排序参考维度,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件排序后的每个文件也是按某个大小存放,这里以2MB为例,如此,通过对分区内文件的排序提高了文件处理的精准度。
步骤S205,基于差异比对算法,对n个排序后的文件进行数据比对。
在一个可实现的实施例中,参照图6所示,图6示出了本申请的文件的数据比对方法的整体流程,首先将对账文件即大文件等比拆分为N个子文件;然后按照用户维度做分区,将这N个子文件进行数据分区分到M个数据分区中,每个分区包含有m个子文件,接着按分区将每个分区的m个子文件做数据清洗,得到每个分区有n个子文件,最后对按分区对n个子文件中每个子文件进行业务逻辑的数据处理如排序处理,进而对处理后的数据进行数据比对。本申请提供的先拆分后排序的方式对大文件进行处理,提高了文件读取效率以及对账的准确性。
本申请提供的文件的数据比对方法,通过对获取到的对账文件进行等比拆分,得到N个拆分文件;按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中;其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数 据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件;基于差异比对算法,对n个排序后的文件进行数据比对;也就是说,本申请先针对对账文件进行拆分,实现大文件分片解析处理,加快了处理性能,进一步地,针对分区中的文件进行排序,提高了文件处理的精准度,避免了直接处理无序文件较大概率导致处理失败的现象。
在一些实施例中,步骤S203按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件,可以通过如下的步骤实现:
A11,读取m个子文件中每一子文件,遍历每一子文件中的每一行数据的交易类型和交易时间信息。
A12,对每一子文件中的所有行数据,按照具有第j类交易类型且交易时间信息为一小时的清洗归类条件进行处理,得到清洗归类后的所有文件。
其中,交易类型包括第j类交易类型。需要说明的是,清洗归类后的所有文件中的数据是无序的。本申请实施例中,按照交易时间对数据进行排序。
本申请实施例中,交易类型至少包括申购和赎回。
针对M个数据分区中的每一分区,读取每个子文件,遍历每一行,按照交易类型、交易时间范围如一小时对数据进行数据清洗归类,存储到不同的文件中。示例性的,本申请的对账文件中文件每一行的各个数据字段使用“|”分隔,这里列出一些关键字段,格式如下:交易流水号|用户帐号|交易类型|交易日期|交易时间|交易金额|交易份额|备注。
参见图7所示,对于某一分区内的文件处理,不同交易类型的数据按照交易时间段存放到指定文件中,时间段以一小时为例,其中,交易类型字段0为申购,1为赎回,非关键字段用**代替。图7示出了对于上述一分区内的文件进行清洗归类后的文件,例如包括:申购09小时交易数据,即交易类型为申购且交易时间为09小时的交易数据;赎回09小时交易数据,即交易类型为赎回且交易时间为09小时的交易数据;以及赎回10小时交易数据,即交易类型为赎回且交易时间为10小时的交易数据。
A13,对清洗归类后的所有文件进行等比拆分,得到W个拆分文件。
其中,W个拆分文件中包括待排序的具有第j类交易类型且交易时间信息为一小时的w个文件,所有交易类型对应的w个文件组成n个待排序文件。
这里,每个分区内的文件数据按照交易类型和小时范围归集到不同的文件里面。在数据清洗归类后,每个分区内的文件数据按照交易类型和小时范围归集到不同的文件里面。有些小时交易的数据量可能比较大,按照前面的文件切分原则,文件到了10MB大小之后,数据切分到第二个文件存储。所以同一个交易类型同一个小时文件下面可能会有w个文件即很多个交易文件。
在一些实施例中,步骤S204按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件,可以通过如下的步骤实现:
A21,对每一w个文件中的每一文件进行编号,得到编号1至编号w的多个文件。
对于需要排序某交易类型某一个小时的w个文件,每个文件10MB为例,对这w个一小时的文件进行编号1到w,如图8所示,交易类型为申购且交易时间为09点对应有w个文件,对这一w个文件中的每一文件进行编号,得到编号1至编号w的多个文件,包括:09申购小时交易数据文件1、09申购小时交易数据文件2、09申购小时交易数据文件3……09申购小时交易数据文件w。此时所有文件中的数据是无序的。本申请中,采用如下的命名方式给文件“{交易类型}_交易时间段_文件编号”命名。例如申购09 小时的文件名为0_09_000001,赎回09小时的文件名为1_09_000001,编号使用6位的整数递增。
A22,按照文件内存映射方式针对编号1至编号w中编号i的文件,每次并行读取编号i的文件中预设大小的文件块,得到编号i的大小相同的多个文件块。
本申请实施例中,在排序的过程中,针对w个文件中具有不同编号的文件,并行排序处理。这里,以针对编号i的文件例如编号1的文件进行排序为例说明,其他编号的文件采用相同的排序方式。针对编号1的文件大小为10MB,每次读取2MB文件块,那么编号1的文件会等分为5块读取。
A23,读取编号i的大小相同的多个文件块中的文件块k,并行解析文件块k中每一行数据,得到文件块k中每一行数据的交易时间信息。
示例性的,读取编号1的大小相同的多个文件块中的文件块k,并行解析文件块k中每一行数据,得到文件块k中每一行数据的交易时间信息,这里,读取第一块2MB数据,逐行解析每一行数据,获取交易时间作为排序的依据。
A24,若读取到文件块k中第i+1行数据,将第i+1行数据与前i行数据进行比较,确定第i+1行数据在文件块k中的目标位置,并将第i+1行数据插入目标位置,得到排序后的文件块k。
其中,排序后的文件块k中位于目标位置的第i+1行数据的交易时间,在位于目标位置的前一相邻位置的第i行数据的交易时间之后,且在位于目标位置的后一相邻位置的第i+2行数据的交易时间之前。
这里,针对文件块k,每读取一行数据,就与文件块k中读取的行数据的前面的数据比较,找到大于等于前面时间同时小于后面时间的位置,将那一行数据插入到那个位置,后面的数据往后移一行。将排序后的文件块重新写编号1这个文件的第一个2MB文件块,实现针对文件块k的排序。
示例性的,参见图9所示,对于文件块1,文件块1包含6行数据,读取第一行数据后,仅与后一行的数据进行比较,由于090002小于后一行092005所以第一行数据的位置不变;读取第二行数据后,092005大于090002的时间,与后一行的数据进行比较092005仍旧大于090102,说明应该将092005对应的行数据与090102对应的行数据的位置交换,交换后,读取第三行数据,092005大于090102且小于094002,继续读取后续的行数据进行排序,直至文件块1中每一行数据对应的时间大于等于前面时间同时小于后面时间,则针对文件块1排序完成,排序后的文件块1如图10所示。
本申请实施例中,针对编号i的大小相同的多个文件块中的文件块k中每一文件块进行排序后,得到多个排序后的文件块,例如对于文件块1得到了5个排序后的文件块。
针对编号1的文件块2、文件块3、文件块4和文件块5,也是采用与文件1相同的排序方法进行排序,将排序后的每一文件块k重新写编号i这个文件,将内排结果持久化到磁盘上,实现针对每一文件块的排序。
A25,基于多行匹配的排序方式,对编号i的排序后的所有文件块之间进行排序,得到编号i的排序后的所有文件块。
本申请实施例中,A24得到的是针对文件块k各自排序后的文件块,进一步地,使用多行匹配进行排序,实现对编号i的多个文件块之间的排序。
这里,对于文件块1和文件块2之间的排序,读取文件块1和文件块2,找到文件块2连续m行(这个m可能为1)在文件块1里面大于等于前面时间同时小于后面时间的位置,将这m行插入这个位置,同时将后面的行从第一个文件块往后下移,并且文件块1最后的m行会往文件块2下移m行,下移到文件块2的m行也反向比较,移到文件块2对应排序的位置放进去,实现两个文件块排序的目的。
示例性的,参照图11所示,读取文件块1和文件块2,找到文件块2连续2行在文件块1里面大于等于前面时间同时小于后面时间的位置,即文件块1中第4-5行所在的位置,将文件块2连续2行插入这个位置,同时将后面的行从第一个文件块往后下移,并且文件块1中移动后的第7-8行会往文件块2下移m行,下移到文件块2的2行也反向比较,找到文件块1中移动后的第7-8行在文件块2里面大于等于前面时间同时小于后面时间的位置,例如文件块1中移动后的第7行应该插入文件块2中的第4行,文件块1中移动后的第8行应该插入文件块2中的第6行,移到文件块2对应排序的位置放进去,实现两个文件块排序的目的。
这样,文件块1和文件块2再内存中就排序好了,把排序后的结果重新写回编号1的第一个和第二个2MB文件块。
同样的,参照图12所示,文件块3分别和文件块1和文件块2比较,找到合适的位置,将文件块3中的m行移到文件块1或文件块2块合适的位置,对应的,插入后多余的m行下移到文件块2或文件块3,若移到文件块2,排序后余下的m行继续下移到文件块3,最终达到三个文件块排序的目的。
文件块4和文件块5块也同样如此处理,文件块4分别和文件块1、文件块2和文件块3比较选择合适位置插入排序,文件块5分别和文件块1、文件块2、文件块3和文件块4比较,选择合适位置插入排序。最后编号1的多个文件块之间的排序就完成了。
A26,基于多行匹配的排序方式,对编号1至编号w的文件之间进行排序,得到n个排序后的文件。
本申请实施例中,编号2到编号w的文件使用编号1文件的排序方式进行排序,不同编号的文件各自的排序可以并行处理,使用多线程或分布式集群等技术实现,这样就把w个文件都单独排序好了。
进一步地,扩展编号1排序的方法,针对多个文件块的排序,如下图13所示,文件编号2的文件块1的p行数据,与文件编号1的文件块1比较,若有合适位置,将p行数据插入该位置,接着文件编号1的所有文件块下移p行数据,移到文件编号2内,继续反向找到合适位置存入。循环往复,将文件编号1和文件编号2的所有文件块的所有数据行都排序好了。
进一步地,文件编号3到文件编号n也这样操作,和文件编号1和文件编号2比较,最终就完成了申购09小时文件的排序。
在一些实施例中,步骤S205基于差异比对算法,对n个排序后的文件进行数据比对,可以通过如下的步骤实现:
A31,按照编号i的排序后的所有文件块的首行交易时间字段和尾行交易字段,从数据库中导出与数据库文件i。
其中,编号i的排序后的所有文件块与数据库文件i具有相同的数据分区标识。
本申请实施例中,在对账的过程中,按照前述的数据清洗同样的规则从数据库导出数据,根据框定的数据类型、交易时间范围导出子文件。从数据库导出数据是,首先,读取分区内已排序好的文件的时间范围,以单个文件为例,直接读取文件首行和尾行的交易时间字段。其次,根据范围导出交易数据。
本申请实施例中,在从数据库导出文件的过程中,根据文件名可以获取交易类型,这样从数据库导出交易记录的时候就可以框定时间范围,同时使用数据库脚本还可以直接排序。
进一步地,对于导出的文件,可以按规则命名文件名,数据库导出的文件命名规则在前面的对账文件名之前加上“db_”前缀。例如分区内排序后对账文件名为“0_09_000001”,则数据库导出文件名为“db_0_09_000001”
在一些实施例中,排序数据库导出文件有两种情况,第一种,如果一个分区只有一个数据库,那么步骤中数据库导出文件已经按规则排序好了。第二种,如果一个分区使用了多个数据库,那么一个对账文件对应的数据库导出文件就有多个,示例性的,参照图14所示,以三个数据库为例,即对账文件名为“0_09_000001”对应数据库导出文件为“db1_0_09_000001”、“db2_0_09_000001”、“db3_0_09_000001”。此时,对这三个文件的数据行做排序和合并成一个文件。这里使用前面用到的文件排序算法即可。
示例性的,参照图15和图16所示,按照编号1的所有文件块的首行交易时间字段090002和尾行交易时间字段095716,从数据库中导出数据库文件1,导出的数据库文件1的文件名为db_09_000001。这里,在导出之前,还可以判断交易类型,如果交易类型不符,则停止导出。
A32,基于差异比对算法计算编号i的排序后的所有文件块的第一散列值。
本申请实施例中,差异比对算法包括但不限于消息摘要算法md5。
A33,基于差异比对算法计算数据库文件i的第二散列值。
本申请实施例中,使用基于消息摘要算法的差异比对算法,对于图17所示的各个分区,将每个分区里面一一对应的文件比对差异,筛选出有差异的文件,参照图18所示,对各个文件使用消息摘要算法md5计算一个值H,以0_09_000001和db_0_09_000001两个文件为例,若对账文件0_09_000001和数据库导出文件db_0_09_000001计算出的H比较相同的,说明对账是一致的,不需要处理,直接排除掉。
若对账文件0_09_000001和数据库导出文件db_0_09_000001计算出的H比较不相同,说明对账不一致,需要继续下一步处理。
A34,若第一散列值与第二散列值不同,确定编号i的排序后的所有文件块与数据库文件i不同。
A35,基于数据匹配算法,从数据库文件i中去掉与编号i的排序后的所有文件块中相同的文件块,筛选出数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块。
这里,经过对文件md5值的比较之后,完全一致的文件被去重掉了,剩下的是对账文件和数据库导出文件有差异的文件。在有差异的文件里面,因为是排序过的,可能大部分连续行的块也是相等的,这里就可以使用文件块数据匹配的方式,去掉相同的文件块,留下差异部分。
本申请实施例中,经过对文件md5值的比较之后,完全一致的文件被去重掉了,剩下的是对账文件和数据库导出文件有差异的文件。
A36,确定数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块之间的差异信息,并基于差异信息进行数据比对。
在一些实施例中,A35基于数据匹配算法,从数据库文件i中去掉与编号i的排序后的所有文件块中相同的文件块,筛选出数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块,可以通过如下的步骤实现:
A351,若编号i的排序后的所有文件块与数据库文件i中包含的数据的行数不同,至少一次去掉编号i的排序后的所有文件块和数据库文件i中的首行数据和尾行数据,得到编号i的去行后的排序后的所有文件块和数据库文件i的去行后的文件块。
其中,编号i的去行后的排序后的所有文件块和数据库文件i的去行后的文件块具有相同的交易时段。
A352,基于差异比对算法计算编号i的去行后的排序后的所有文件块的第三散列值。
A353,基于差异比对算法计算数据库文件i的去行后的文件块的第四散列值。
A354,若第三散列值与第四散列值不同,确定编号i的去行后的排序后的所有文件 块和数据库文件i的去行后的文件块不同。
A355,基于数据匹配算法,从数据库文件i的去行后的文件块中去掉与编号i的去行后的排序后的所有文件块中相同的文件块,筛选出数据库文件i的去行后的文件块中的第三差异文件块和编号i的去行后的排序后的所有文件块中的第四差异文件块。
A356,至少一次去掉编号i的第四差异文件块和数据库文件i的第三差异文件块的尾行数据,筛选出数据库文件i中的第一差异文件块和编号i的第二差异文件块;其中,数据库文件i中的第一差异文件块和编号i的第二差异文件块具有相同的尾行交易时间。
参照图19、图20和图21所示,1)、以对账文件0_09_0000002和数据库导出文件db_0_09_000002为例,先比较两个文件的行数,一般可能存在如下三种情况:
第一种情况是,0_09_0000002文件行数大于db_0_09_000002。
第二种情况是,0_09_0000002文件行数等于db_0_09_000002。
第三种情况是,0_09_0000002文件行数小于db_0_09_000002。
2)、继续比较两个文件首行,去掉两个文件中首行交易时间较小的行
3)、接着比较两个文件尾行,去掉两个文件中尾行交易时间较大的行
4)、多次去掉首行和尾行的操作之后,直到两个文件首行交易时间相等,尾行交易时间也相等,并且两个文件数据的行数相等时,使用消息摘要算法sha1取两个文件的sha1并比较。
5)、如果两个sha1值相等,去掉两个文件中相等的文件块,留下两个文件当中前面被排除调的行。到步骤7)
6)、如果两个sha1值不相等,两个文件同时去掉尾行,继续比较尾行的交易时间,重新等到两个文件尾行交易时间相等,行数相等的场景,回到步骤5)
7)、循环步骤1)到步骤6),直到没有相同的文件块。
进一步地,可以使用连个关联数组分别存储两个差异文件的sha1值和数据行。
在一些实施例中,A36确定数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块之间的差异信息,并基于差异信息进行数据比对,可以通过如下的步骤实现:
A361,基于差异比对算法计算数据库文件i中的第一差异文件块的第五散列值,并将第五散列值作为键且将数据库文件i中的第一差异文件块的数据行作为值记录为数据库文件i的第一关联数组。
A362,基于差异比对算法计算编号i的第二差异文件块的第六散列值,并将第六散列值作为键且将编号i的第二差异文件块的数据行作为值记录为编号i的第二关联数组。
A363,在每一分区内,将数据库文件i的第一关联数组的键与编号i的第二关联数组的键比较,去掉两个关联数组中键相同的数据行,得到数据库文件i的第三关联数组与编号i的第四关联数组。
其中,数据库文件i的第三关联数组中的键为每行数据的交易流水号且值为数据行;编号i的第四关联数组中的键为每行数据的交易流水号且值为数据行。
示例性的,如图22所示,得到键为每行数据的交易流水号且值为数据行的关联数组。
本申请实施例中,编号i的第二关联数组用关联数组A表示,数据库文件i的第一关联数组用关联数组B表示,编号i的第四关联数组用关联数组C表示,数据库文件i的第三关联数组用关联数组D表示。
A364,确定数据库文件i的第三关联数组与编号i的第四关联数组之间的差异信息,并基于差异信息进行数据比对。
在一些实施例中,A364确定数据库文件i的第三关联数组与编号i的第四关联数组 之间的差异信息,并基于差异信息进行数据比对,可以通过如下的步骤实现:
A3641,若编号i的第四关联数组中存在数据库文件i的第三关联数组中不存在的第一键,确定差异信息表征对账文件有数据库没有的数据,基于第一键确定编号i的第四关联数组对应的数据行,在数据库文件i的第三关联数组中增加编号i的第四关联数组对应的数据行。
A3642,若编号i的第四关联数组中不存在数据库文件i的第三关联数组中存在的第二键,确定差异信息表征对账文件没有数据库有的数据,基于第二键确定数据库文件i的第三关联数组对应的数据行,并删除数据库文件i的第三关联数组对应的数据行。
A3643,若编号i的第四关联数组中存在数据库文件i的第三关联数组中存在的第三键,确定差异信息表征对账文件和数据库均存在交易数据,且交易数据不一致,将编号i的第四关联数组中第三键对应的数据行替换为编号i的第四关联数组中第三键对应的数据行。
在一个可实现的实施例中,参照图23所示,以关联数组A、关联数组B、关联数组C以及关联数组D,针对本申请对账的实现作出进一步的说明,这里,将所有的文件处理完之后,每个分区只剩下对账文件和数据库文件比较后的两个差异的关联数组。将对账文件产生的差异关联数组标记为关联数组A,数据库文件产生的差异关联数组标记为关联数组B。到这里,数据已经很少了,但是依旧有比较小的几率可能存在相同的文件行。
在每个分区内,分别遍历两个关联数组。将所有关联数组A的key与所有关联数组B的key比较,排除两个关联数组中key相同的数据。将两个关联数组A和B中额剩余数据重新映射为新的关联数组C和关联数组D,取每行数据唯一的交易流水号作为key,数据行作为value。
将所有关联数组C的key与所有关联数组D的key比较,存在三种情况:
第一种,关联数组C存在,关联数组D不存在的key,说明对账文件有数据库没有的数据,需要使用key再关联数组C对应的数据行,将key关联的数组C对应的数据行增加到关联数组D中,如此,实现在关联数组D中新增交易流水号,以及针对新增交易流水号增加交易数据。
第二种,关联数组C不存在,关联数组D存在的key,说明对账文件没有数据库有的数据,需要使用key再关联数组D对应的数据行,并删除key关联的数组D对应的数据行,如此,实现对数据库中多余的且不正确的交易信息的删除。
第三种,关联数组C存在,关联数组D也存在的key,说明对账文件和数据库都存在的数据,但是交易数据不一致(因为一致的数据前面的算法已经去掉了),需要使用key再关联数组C对应的数据行,将key关联的数组D对应的数据行,替换为key关联的数组C对应的数据行,如此,确保数据库中的数据与对账文件中的数据是一致的。
由上述可知,本申请的文件与数据差异比对算法,采用大文件分片解析处理算法中的排序,以及分区的设计,方便使用分布式系统并行计算,文件块数据匹配效率会比较高;同时文件处理的正确性高,大大降低了文件处理失败的几率。
下面继续说明本申请实施例提供的文件的数据比对装置154实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器150的文件的数据比对装置154中的软件模块可以是服务器100中的文件的数据比对装置,包括:
处理模块1541,用于对获取到的对账文件进行等比拆分,得到N个拆分文件;
处理模块1541,用于按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中;其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;
处理模块1541,用于按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;
处理模块1541,用于按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件;
对账模块1542,用于基于差异比对算法,对n个排序后的文件进行数据比对。
在一些实施例中,处理模块1541,用于读取m个子文件中每一子文件,遍历每一子文件中的每一行数据的交易类型和交易时间信息;对每一子文件中的所有行数据,按照具有第j类交易类型且交易时间信息为一小时的清洗归类条件进行处理,得到清洗归类后的所有文件;其中,交易类型包括第j类交易类型;对清洗归类后的所有文件进行等比拆分,得到W个拆分文件;其中,W个拆分文件中包括待排序的具有第j类交易类型且交易时间信息为一小时的w个文件,所有交易类型对应的w个文件组成n个待排序文件。
在一些实施例中,处理模块1541,用于对每一w个文件中的每一文件进行编号,得到编号1至编号w的多个文件;按照文件内存映射方式针对编号1至编号w中编号i的文件,每次并行读取编号i的文件中预设大小的文件块,得到编号i的大小相同的多个文件块;读取编号i的大小相同的多个文件块中的文件块k,并行解析文件块k中每一行数据,得到文件块k中每一行数据的交易时间信息;若读取到文件块k中第i+1行数据,将第i+1行数据与前i行数据进行比较,确定第i+1行数据在文件块k中的目标位置,并将第i+1行数据插入目标位置,得到排序后的文件块k;其中,排序后的文件块k中位于目标位置的第i+1行数据的交易时间,在位于目标位置的前一相邻位置的第i行数据的交易时间之后,且在位于目标位置的后一相邻位置的第i+2行数据的交易时间之前;基于多行匹配的排序方式,对编号i的排序后的所有文件块之间进行排序,得到编号i的排序后的所有文件块;基于多行匹配的排序方式,对编号1至编号w的文件之间进行排序,得到n个排序后的文件。
在一些实施例中,对账模块1542,用于按照编号i的排序后的所有文件块的首行交易时间字段和尾行交易字段,从数据库中导出与数据库文件i;其中,编号i的排序后的所有文件块与数据库文件i具有相同的数据分区标识;基于差异比对算法计算编号i的排序后的所有文件块的第一散列值;基于差异比对算法计算数据库文件i的第二散列值;若第一散列值与第二散列值不同,确定编号i的排序后的所有文件块与数据库文件i不同;基于数据匹配算法,从数据库文件i中去掉与编号i的排序后的所有文件块中相同的文件块,筛选出数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块;确定数据库文件i中的第一差异文件块和编号i的排序后的所有文件块中的第二差异文件块之间的差异信息,并基于差异信息进行数据比对。
在一些实施例中,对账模块1542,用于若编号i的排序后的所有文件块与数据库文件i中包含的数据的行数不同,至少一次去掉编号i的排序后的所有文件块和数据库文件i中的首行数据和尾行数据,得到编号i的去行后的排序后的所有文件块和数据库文件i的去行后的文件块;其中,编号i的去行后的排序后的所有文件块和数据库文件i的去行后的文件块具有相同的交易时段;基于差异比对算法计算编号i的去行后的排序后的所有文件块的第三散列值;基于差异比对算法计算数据库文件i的去行后的文件块的第四散列值;若第三散列值与第四散列值不同,确定编号i的去行后的排序后的所有文件块和数据库文件i的去行后的文件块不同;基于数据匹配算法,从数据库文件i的去行后的文件块中去掉与编号i的去行后的排序后的所有文件块中相同的文件块,筛选出数据库文件i的去行后的文件块中的第三差异文件块和编号i的去行后的排序后的所有文件块中的第四差异文件块;至少一次去掉编号i的第四差异文件块和数据库文件i 的第三差异文件块的尾行数据,筛选出数据库文件i中的第一差异文件块和编号i的第二差异文件块;其中,数据库文件i中的第一差异文件块和编号i的第二差异文件块具有相同的尾行交易时间。
在一些实施例中,对账模块1542,用于基于差异比对算法计算数据库文件i中的第一差异文件块的第五散列值,并将第五散列值作为键且将数据库文件i中的第一差异文件块的数据行作为值记录为数据库文件i的第一关联数组;基于差异比对算法计算编号i的第二差异文件块的第六散列值,并将第六散列值作为键且将编号i的第二差异文件块的数据行作为值记录为编号i的第二关联数组;在每一分区内,将数据库文件i的第一关联数组的键与编号i的第二关联数组的键比较,去掉两个关联数组中键相同的数据行,得到数据库文件i的第三关联数组与编号i的第四关联数组;其中,数据库文件i的第三关联数组中的键为每行数据的交易流水号且值为数据行;编号i的第四关联数组中的键为每行数据的交易流水号且值为数据行;确定数据库文件i的第三关联数组与编号i的第四关联数组之间的差异信息,并基于差异信息进行数据比对。
在一些实施例中,对账模块1542,用于若编号i的第四关联数组中存在数据库文件i的第三关联数组中不存在的第一键,确定差异信息表征对账文件有数据库没有的数据,基于第一键确定编号i的第四关联数组对应的数据行,在数据库文件i的第三关联数组中增加编号i的第四关联数组对应的数据行;若编号i的第四关联数组中不存在数据库文件i的第三关联数组中存在的第二键,确定差异信息表征对账文件没有数据库有的数据,基于第二键确定数据库文件i的第三关联数组对应的数据行,并删除数据库文件i的第三关联数组对应的数据行;若编号i的第四关联数组中存在数据库文件i的第三关联数组中存在的第三键,确定差异信息表征对账文件和数据库均存在交易数据,且交易数据不一致,将编号i的第四关联数组中第三键对应的数据行替换为编号i的第四关联数组中第三键对应的数据行。
本申请提供的文件的数据比对装置,通过对获取到的对账文件进行等比拆分,得到N个拆分文件;按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中;其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件;基于差异比对算法,对n个排序后的文件进行数据比对;也就是说,本申请先针对对账文件进行拆分,实现大文件分片解析处理,加快了处理性能,进一步地,针对分区中的文件进行排序,提高了文件处理的精准度,避免了直接处理无序文件较大概率导致处理失败的现象。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法。
在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存 储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(超文本标记语言,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。
本申请实施例提供一种文件的数据比对方法、装置、设备及存储介质,通过对获取到的对账文件进行等比拆分,得到N个拆分文件;按照对账文件关联的交易所具有的用户标识,将N个拆分文件分到M个数据分区中;其中,M个数据分区中每一数据分区对应一个用户标识,每一数据分区包含m个子文件;按照交易类型和交易时间信息,对每一数据分区中的m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;按照交易时间信息,对M个数据分区中的n个待排序文件进行排序,得到n个排序后的文件;基于差异比对算法,对n个排序后的文件进行数据比对;也就是说,本申请先针对对账文件进行拆分,实现大文件分片解析处理,加快了处理性能,进一步地,针对分区中的文件进行排序,提高了文件处理的精准度,避免了直接处理无序文件较大概率导致处理失败的现象。
Claims (10)
- 一种文件的数据比对方法,包括:对获取到的对账文件进行等比拆分,得到N个拆分文件;按照所述对账文件关联的交易所具有的用户标识,将所述N个拆分文件分到M个数据分区中;其中,所述M个数据分区中每一数据分区对应一个用户标识,所述每一数据分区包含m个子文件;按照交易类型和交易时间信息,对所述每一数据分区中的所述m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;按照所述交易时间信息,对所述M个数据分区中的所述n个待排序文件进行排序,得到n个排序后的文件;基于差异比对算法,对所述n个排序后的文件进行数据比对。
- 根据权利要求1中所述的方法,其中,所述按照交易类型和交易时间信息,对所述每一数据分区中的所述m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件包括:读取所述m个子文件中每一子文件,遍历所述每一子文件中的每一行数据的所述交易类型和所述交易时间信息;对所述每一子文件中的所有行数据,按照具有第j类交易类型且所述交易时间信息为一小时的清洗归类条件进行处理,得到所述清洗归类后的所有文件;其中,所述交易类型包括所述第j类交易类型;对所述清洗归类后的所有文件进行等比拆分,得到W个拆分文件;其中,所述W个拆分文件中包括待排序的具有第j类交易类型且所述交易时间信息为一小时的w个文件,所有交易类型对应的w个文件组成所述n个待排序文件。
- 根据权利要求2所述的方法,其中,所述按照所述交易时间信息,对所述M个数据分区中的所述n个待排序文件进行排序,得到n个排序后的文件,包括:对每一所述w个文件中的每一文件进行编号,得到编号1至编号w的多个文件;按照文件内存映射方式针对所述编号1至所述编号w中编号i的文件,每次并行读取所述编号i的文件中预设大小的文件块,得到所述编号i的大小相同的多个文件块;读取所述编号i的大小相同的多个文件块中的文件块k,并行解析所述文件块k中每一行数据,得到所述文件块k中所述每一行数据的交易时间信息;若读取到所述文件块k中第i+1行数据,将所述第i+1行数据与前i行数据进行比较,确定所述第i+1行数据在所述文件块k中的目标位置,并将所述第i+1行数据插入所述目标位置,得到排序后的所述文件块k;其中,所述排序后的所述文件块k中位于所述目标位置的所述第i+1行数据的交易时间,在位于所述目标位置的前一相邻位置的第i行数据的交易时间之后,且在位于所述目标位置的后一相邻位置的第i+2行数据的交易时间之前;基于多行匹配的排序方式,对所述编号i的排序后的所有文件块之间进行排序,得到所述编号i的排序后的所有文件块;基于多行匹配的排序方式,对所述编号1至所述编号w的文件之间进行排序,得到所述n个排序后的文件。
- 根据权利要求3中所述的方法,其中,所述基于差异比对算法,对所述n个排序后的文件进行数据比对,包括:按照所述编号i的排序后的所有文件块的首行交易时间字段和尾行交易字段,从数 据库中导出与数据库文件i;其中,所述编号i的排序后的所有文件块与所述数据库文件i具有相同的数据分区标识;基于所述差异比对算法计算所述编号i的排序后的所有文件块的第一散列值;基于所述差异比对算法计算所述数据库文件i的第二散列值;若所述第一散列值与所述第二散列值不同,确定所述编号i的排序后的所有文件块与所述数据库文件i不同;基于数据匹配算法,从所述数据库文件i中去掉与所述编号i的排序后的所有文件块中相同的文件块,筛选出所述数据库文件i中的第一差异文件块和所述编号i的排序后的所有文件块中的第二差异文件块;确定所述数据库文件i中的第一差异文件块和所述编号i的排序后的所有文件块中的第二差异文件块之间的差异信息,并基于所述差异信息进行数据比对。
- 根据权利要求4中所述的方法,其中,所述基于数据匹配算法,从所述数据库文件i中去掉与所述编号i的排序后的所有文件块中相同的文件块,筛选出所述数据库文件i中的第一差异文件块和所述编号i的排序后的所有文件块中的第二差异文件块,包括:若所述编号i的排序后的所有文件块与所述数据库文件i中包含的数据的行数不同,至少一次去掉所述编号i的排序后的所有文件块和所述数据库文件i中的首行数据和尾行数据,得到所述编号i的去行后的排序后的所有文件块和所述数据库文件i的去行后的文件块;其中,所述编号i的去行后的排序后的所有文件块和所述数据库文件i的去行后的文件块具有相同的交易时段;基于所述差异比对算法计算所述编号i的去行后的排序后的所有文件块的第三散列值;基于所述差异比对算法计算所述数据库文件i的去行后的文件块的第四散列值;若所述第三散列值与所述第四散列值不同,确定所述编号i的去行后的排序后的所有文件块和所述数据库文件i的去行后的文件块不同;基于所述数据匹配算法,从所述数据库文件i的去行后的文件块中去掉与所述编号i的去行后的排序后的所有文件块中相同的文件块,筛选出所述数据库文件i的去行后的文件块中的第三差异文件块和所述编号i的去行后的排序后的所有文件块中的第四差异文件块;至少一次去掉所述编号i的所述第四差异文件块和所述数据库文件i的第三差异文件块的尾行数据,筛选出所述数据库文件i中的所述第一差异文件块和所述编号i的所述第二差异文件块;其中,所述数据库文件i中的所述第一差异文件块和所述编号i的所述第二差异文件块具有相同的尾行交易时间。
- 根据权利要求4或5所述的方法,其中,所述确定所述数据库文件i中的第一差异文件块和所述编号i的排序后的所有文件块中的第二差异文件块之间的差异信息,并基于所述差异信息进行数据比对,包括:基于所述差异比对算法计算所述数据库文件i中的第一差异文件块的第五散列值,并将所述第五散列值作为键且将所述数据库文件i中的第一差异文件块的数据行作为值记录为所述数据库文件i的第一关联数组;基于所述差异比对算法计算所述编号i的所述第二差异文件块的第六散列值,并将所述第六散列值作为键且将所述编号i的所述第二差异文件块的数据行作为值记录为所述编号i的第二关联数组;在每一分区内,将所述数据库文件i的第一关联数组的键与所述编号i的第二关联数组的键比较,去掉两个关联数组中键相同的数据行,得到所述数据库文件i的第三关 联数组与所述编号i的第四关联数组;其中,所述数据库文件i的第三关联数组中的键为每行数据的交易流水号且值为数据行;所述编号i的第四关联数组中的键为每行数据的交易流水号且值为数据行;确定所述数据库文件i的第三关联数组与所述编号i的第四关联数组之间的差异信息,并基于所述差异信息进行数据比对。
- 根据权利要求6所述的方法,其中,所述确定所述数据库文件i的第三关联数组与所述编号i的第四关联数组之间的差异信息,并基于所述差异信息进行数据比对,包括:若所述编号i的第四关联数组中存在所述数据库文件i的第三关联数组中不存在的第一键,确定所述差异信息表征所述对账文件有所述数据库没有的数据,基于所述第一键确定所述编号i的第四关联数组对应的数据行,在所述数据库文件i的第三关联数组中增加所述编号i的第四关联数组对应的数据行;若所述编号i的第四关联数组中不存在所述数据库文件i的第三关联数组中存在的第二键,确定所述差异信息表征所述对账文件没有所述数据库有的数据,基于所述第二键确定所述数据库文件i的第三关联数组对应的数据行,并删除所述数据库文件i的第三关联数组对应的数据行;若所述编号i的第四关联数组中存在所述数据库文件i的第三关联数组中存在的第三键,确定所述差异信息表征所述对账文件和所述数据库均存在交易数据,且所述交易数据不一致,将所述编号i的第四关联数组中所述第三键对应的数据行替换为所述编号i的第四关联数组中所述第三键对应的数据行。
- 一种文件的数据比对装置,其中,包括:处理模块,用于对获取到的对账文件进行等比拆分,得到N个拆分文件;所述处理模块,用于按照所述对账文件关联的交易所具有的用户标识,将所述N个拆分文件分到M个数据分区中;其中,所述M个数据分区中每一数据分区对应一个用户标识,所述每一数据分区包含m个子文件;所述处理模块,用于按照交易类型和交易时间信息,对所述每一数据分区中的所述m个子文件进行数据清洗归类,并对清洗归类后的所有文件进行等比拆分,得到n个待排序文件;所述处理模块,用于按照所述交易时间信息,对所述M个数据分区中的所述n个待排序文件进行排序,得到n个排序后的文件;对账模块,用于基于差异比对算法,对所述n个排序后的文件进行数据比对。
- 一种文件的数据比对设备,其中,包括:存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至7任一项所述的方法。
- 一种计算机可读存储介质,其中,存储有可执行指令,用于引起处理器执行时,实现权利要求1至7任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110724780.4 | 2021-06-29 | ||
CN202110724780.4A CN113342750B (zh) | 2021-06-29 | 2021-06-29 | 一种文件的数据比对方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023273235A1 true WO2023273235A1 (zh) | 2023-01-05 |
Family
ID=77481343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/140732 WO2023273235A1 (zh) | 2021-06-29 | 2021-12-23 | 一种文件的数据比对方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113342750B (zh) |
WO (1) | WO2023273235A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116308850A (zh) * | 2023-05-19 | 2023-06-23 | 深圳市四格互联信息技术有限公司 | 对账方法、对账系统、对账服务器及存储介质 |
CN116702024A (zh) * | 2023-05-16 | 2023-09-05 | 见知数据科技(上海)有限公司 | 流水数据类型识别方法、装置、计算机设备和存储介质 |
CN116910631A (zh) * | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | 数组对比方法、装置、电子设备及可读存储介质 |
CN117762873A (zh) * | 2023-12-20 | 2024-03-26 | 中邮消费金融有限公司 | 数据处理方法、装置、设备及存储介质 |
CN118394849A (zh) * | 2024-06-26 | 2024-07-26 | 杭州古珀医疗科技有限公司 | 一种医疗领域中全量数据的差异比对方法和装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113342750B (zh) * | 2021-06-29 | 2023-11-17 | 深圳前海微众银行股份有限公司 | 一种文件的数据比对方法、装置、设备及存储介质 |
CN113837878B (zh) * | 2021-09-07 | 2024-05-03 | 中国银联股份有限公司 | 一种数据比对方法、装置、设备及存储介质 |
CN113656654B (zh) * | 2021-10-19 | 2022-05-10 | 云丁网络技术(北京)有限公司 | 一种用于设备添加的方法、装置及系统 |
CN113886332B (zh) * | 2021-12-09 | 2022-02-08 | 广东睿江云计算股份有限公司 | 一种大文件差异对比方法、装置、计算机设备及存储介质 |
CN114363321A (zh) * | 2021-12-30 | 2022-04-15 | 支付宝(杭州)信息技术有限公司 | 文件传输方法、设备及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10586019B1 (en) * | 2014-01-24 | 2020-03-10 | The Pnc Financial Services Group, Inc. | Automated healthcare cash account reconciliation method |
CN111325617A (zh) * | 2020-01-22 | 2020-06-23 | 北京开科唯识技术有限公司 | 基于文件的对账方法、装置、计算机设备和可读存储介质 |
CN112037003A (zh) * | 2020-09-17 | 2020-12-04 | 中国银行股份有限公司 | 文件对账处理方法及装置 |
CN112613964A (zh) * | 2020-12-25 | 2021-04-06 | 深圳鼎盛电脑科技有限公司 | 一种对账方法、装置、设备及存储介质 |
CN113342750A (zh) * | 2021-06-29 | 2021-09-03 | 深圳前海微众银行股份有限公司 | 一种文件的数据比对方法、装置、设备及存储介质 |
-
2021
- 2021-06-29 CN CN202110724780.4A patent/CN113342750B/zh active Active
- 2021-12-23 WO PCT/CN2021/140732 patent/WO2023273235A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10586019B1 (en) * | 2014-01-24 | 2020-03-10 | The Pnc Financial Services Group, Inc. | Automated healthcare cash account reconciliation method |
CN111325617A (zh) * | 2020-01-22 | 2020-06-23 | 北京开科唯识技术有限公司 | 基于文件的对账方法、装置、计算机设备和可读存储介质 |
CN112037003A (zh) * | 2020-09-17 | 2020-12-04 | 中国银行股份有限公司 | 文件对账处理方法及装置 |
CN112613964A (zh) * | 2020-12-25 | 2021-04-06 | 深圳鼎盛电脑科技有限公司 | 一种对账方法、装置、设备及存储介质 |
CN113342750A (zh) * | 2021-06-29 | 2021-09-03 | 深圳前海微众银行股份有限公司 | 一种文件的数据比对方法、装置、设备及存储介质 |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116702024A (zh) * | 2023-05-16 | 2023-09-05 | 见知数据科技(上海)有限公司 | 流水数据类型识别方法、装置、计算机设备和存储介质 |
CN116702024B (zh) * | 2023-05-16 | 2024-05-28 | 见知数据科技(上海)有限公司 | 流水数据类型识别方法、装置、计算机设备和存储介质 |
CN116308850A (zh) * | 2023-05-19 | 2023-06-23 | 深圳市四格互联信息技术有限公司 | 对账方法、对账系统、对账服务器及存储介质 |
CN116308850B (zh) * | 2023-05-19 | 2023-09-05 | 深圳市四格互联信息技术有限公司 | 对账方法、对账系统、对账服务器及存储介质 |
CN116910631A (zh) * | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | 数组对比方法、装置、电子设备及可读存储介质 |
CN116910631B (zh) * | 2023-09-14 | 2024-01-05 | 深圳市智慧城市科技发展集团有限公司 | 数组对比方法、装置、电子设备及可读存储介质 |
CN117762873A (zh) * | 2023-12-20 | 2024-03-26 | 中邮消费金融有限公司 | 数据处理方法、装置、设备及存储介质 |
CN118394849A (zh) * | 2024-06-26 | 2024-07-26 | 杭州古珀医疗科技有限公司 | 一种医疗领域中全量数据的差异比对方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN113342750A (zh) | 2021-09-03 |
CN113342750B (zh) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023273235A1 (zh) | 一种文件的数据比对方法、装置、设备及存储介质 | |
WO2022126974A1 (zh) | 基于Kafka的增量数据同步方法、装置、设备及介质 | |
US10180992B2 (en) | Atomic updating of graph database index structures | |
US10671586B2 (en) | Optimal sort key compression and index rebuilding | |
US11663177B2 (en) | Systems and methods for extracting data in column-based not only structured query language (NoSQL) databases | |
CN110109910A (zh) | 数据处理方法及系统、电子设备和计算机可读存储介质 | |
WO2019161645A1 (zh) | 基于Shell的数据表提取方法、终端、设备及存储介质 | |
JP7153420B2 (ja) | データベース中にグラフ情報を記憶するためのb木使用 | |
WO2023124217A1 (zh) | 一种获取多列数据的综合排列数据的方法与设备 | |
JP2014078085A (ja) | 実行制御プログラム、実行制御方法および情報処理装置 | |
CN114579584B (zh) | 数据表处理方法、装置、计算机设备和存储介质 | |
CN113821573A (zh) | 海量数据快速检索服务构建方法、系统、终端及存储介质 | |
CN115470284A (zh) | 一种多源异构数据源导入Janusgraph图数据库的方法和装置 | |
WO2022174734A1 (zh) | 用于存储数据的方法和装置 | |
Benlachmi et al. | A comparative analysis of hadoop and spark frameworks using word count algorithm | |
WO2023197865A1 (zh) | 一种信息存储方法及装置 | |
CN116010345A (zh) | 一种实现流批一体数据湖的表服务方案的方法、装置及设备 | |
CN115114297A (zh) | 数据轻量存储及查找方法、装置、电子设备及存储介质 | |
WO2022262240A1 (zh) | 数据处理方法、电子设备及存储介质 | |
WO2022001626A1 (zh) | 注入时序数据的方法、查询时序数据的方法及数据库系统 | |
US20220171747A1 (en) | Systems and methods for capturing data schema for databases during data insertion | |
JP2018109898A (ja) | データマイグレーションシステム | |
US11232121B2 (en) | Method, apparatus, and computer-readable medium for data transformation pipeline optimization | |
US10997144B2 (en) | Reducing write amplification in buffer trees | |
US20240329925A1 (en) | Data processing method, apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21948160 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 16/04/2024) |