CN113342750A - File data comparison method, device, equipment and storage medium - Google Patents

File data comparison method, device, equipment and storage medium Download PDF

Info

Publication number
CN113342750A
CN113342750A CN202110724780.4A CN202110724780A CN113342750A CN 113342750 A CN113342750 A CN 113342750A CN 202110724780 A CN202110724780 A CN 202110724780A CN 113342750 A CN113342750 A CN 113342750A
Authority
CN
China
Prior art keywords
file
data
files
difference
transaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110724780.4A
Other languages
Chinese (zh)
Other versions
CN113342750B (en
Inventor
徐继盛
万磊
李毅
钱进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110724780.4A priority Critical patent/CN113342750B/en
Publication of CN113342750A publication Critical patent/CN113342750A/en
Priority to PCT/CN2021/140732 priority patent/WO2023273235A1/en
Application granted granted Critical
Publication of CN113342750B publication Critical patent/CN113342750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a data comparison method of a file, a data comparison device of the file, data comparison equipment of the file and a computer readable storage medium, wherein the method comprises the following steps: carrying out equal ratio splitting on the obtained account checking files to obtain N split files; dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files; according to the transaction type and the transaction time information, performing data cleaning classification on m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted; sorting n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files; and comparing the data of the n sorted files based on a difference comparison algorithm.

Description

File data comparison method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing of financial technology (Fintech), and relates to but is not limited to a file data comparison method, a file data comparison device and a computer readable storage medium.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), however, the financial technology also puts higher demands on the technology due to the requirements of security and real-time performance of the financial industry.
In the field of financial science and technology, users and transaction amount of transaction products of micro-mass banks are very large, and the problem of checking whether daily transactions of the users are processed correctly or not is solved when hundreds of millions of stock users and single-day transactions are faced. For example, the change-through product under the micro-credit user terminal can initiate currency fund subscription redemption transaction in real time, share held by the user can be processed in real time, transaction processing records can be persistently stored in a database, corresponding reconciliation documents can be generated every day, a subscription day final reconciliation document can be generated for the subscription transaction, and a redemption day final reconciliation document can be generated for the redemption. The account checking file uses a special protocol format, one transaction is recorded in each line, billions of transactions are sent to the WeChat financing system through the account checking file, the WeChat financing system needs to check the content of the account checking file and the data of the real-time transaction record of the user, and the inconsistent data needs to be checked by taking the content of the account checking file as a reference.
In the related art, the account checking is realized by referring to the steps in fig. 1, firstly, directly reading an account checking file, and analyzing the content of each line in the account checking file; secondly, obtaining transaction data in a key field matching database through analysis; finally, several results of the matching are processed. When several matched results are processed, if the account checking file has no transaction record, the database is stored in the transaction record, and the transaction needs to be deleted and returned to be processed. If the account checking file has a transaction record, the database does not have the transaction record, and the transaction needs to be newly added and processed. If the account checking file has a transaction record, the database is stored in the transaction record; at this time, there are two cases, one is that the transaction data is inconsistent and the account checked file is required to be used for processing the transaction, and the other is that the transaction data is consistent and the account checked file is consistent and does not need to be processed. Therefore, in the account checking process, the problems of low processing efficiency and long consumed time exist in the related technology that at least when a large file is read, analysis and processing are directly carried out at the same time.
Disclosure of Invention
The embodiment of the application provides a file data comparison method, a file data comparison device and a computer readable storage medium, and aims to solve the problems that in the account checking process of the related art, at least when a large file is read, analysis and processing are directly carried out at the same time, the processing efficiency is low, and the consumed time is long.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data comparison method for a file, which comprises the following steps:
carrying out equal ratio splitting on the obtained account checking files to obtain N split files;
dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files;
according to the transaction type and the transaction time information, performing data cleaning classification on the m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted;
sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and comparing the data of the n sorted files based on a difference comparison algorithm.
A data comparison device for files comprises:
the processing module is used for carrying out equal ratio splitting on the acquired reconciliation files to obtain N split files;
the processing module is used for dividing the N split files into M data partitions according to the user identification of the transaction related to the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files;
the processing module is used for cleaning and classifying the m sub-files in each data partition according to the transaction type and the transaction time information, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted;
the processing module is used for sequencing the n files to be sequenced in the M data partitions according to the transaction time information to obtain n sequenced files;
and the account checking module is used for comparing the data of the n sorted files based on a difference comparison algorithm.
An embodiment of the present application provides an apparatus, including:
a memory for storing executable instructions; a processor, when executing executable instructions stored in the memory, implements the method described above.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to implement the above-mentioned method when executed.
The embodiment of the application has the following beneficial effects:
obtaining N split files by performing equal ratio splitting on the obtained account checking files; dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files; according to the transaction type and the transaction time information, performing data cleaning classification on m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted; sorting n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files; based on a difference comparison algorithm, performing data comparison on the n sorted files; that is to say, this application is split to the account checking file earlier, realizes big file fragmentation analysis processing for the processing performance has been accelerated, and further, arranges in order to the file in the subregion, has improved the precision of file processing, has avoided directly handling unordered file great probability to lead to handling the phenomenon of failing.
Drawings
Fig. 1 is a schematic diagram of a reconciliation flow in the related art;
FIG. 2 is a schematic diagram of an alternative architecture of a server according to an embodiment of the present application;
FIG. 3 is a schematic view of an alternative process of a data comparison method for documents according to an embodiment of the present application;
FIG. 4 is a schematic flowchart of file splitting provided in an embodiment of the present application;
FIG. 5 is a diagram illustrating a result of file splitting provided by an embodiment of the present application;
FIG. 6 is a schematic overall flowchart of a file data comparison method according to an embodiment of the present disclosure;
FIG. 7 is a graph illustrating the results of data cleansing provided by an embodiment of the present application;
FIG. 8 is a diagram illustrating the results of document numbering provided by an embodiment of the present application;
FIG. 9 is a flow chart illustrating sorting of data in file blocks according to an embodiment of the present application;
FIG. 10 is a diagram illustrating the result of ordering data within file blocks according to an embodiment of the present application;
FIG. 11 is a diagram illustrating the result of data sorting between two file blocks according to an embodiment of the present application;
FIG. 12 is a diagram illustrating the result of data sorting among three file blocks according to an embodiment of the present application;
FIG. 13 is a schematic diagram of data ordering between two different numbered files provided by an embodiment of the present application;
FIG. 14 is a schematic diagram of a process for exporting a file from a database according to an embodiment of the present application;
FIG. 15 is a diagram illustrating the results of exporting a file from a database provided by an embodiment of the present application;
FIG. 16 is a schematic flow chart of exporting a file from a database according to an embodiment of the present application;
FIG. 17 is a schematic diagram of reconciliation files with database files in different partitions according to an embodiment of the present application;
FIG. 18 is a schematic diagram illustrating a comparison between a reconciliation file and a database file provided by an embodiment of the present application;
FIG. 19 is a diagram illustrating the result of de-duplicating the difference file between the reconciliation file and the database file provided by an embodiment of the present application;
FIG. 20 is a schematic flowchart illustrating a process of de-duplicating a difference file between a reconciliation file and a database file according to an embodiment of the present application;
fig. 21 is a schematic flowchart of removing duplicate file blocks by calculating sha1 values according to an embodiment of the present application;
FIG. 22 is a schematic diagram of information related to key-value pairs of associated data provided by an embodiment of the present application;
fig. 23 is a schematic flowchart of reconciliation provided in an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the present application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.
An exemplary application of the data comparison device of the file provided in the embodiment of the present application is described below, and the data comparison device of the file provided in the embodiment of the present application may be implemented as any terminal having an on-screen display function, such as a notebook computer, a tablet computer, a desktop computer, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, and the like, and may also be implemented as a server. In the following, an exemplary application will be described when the data alignment apparatus of the file is implemented as a server.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 100 according to an embodiment of the present application, where the server 100 shown in fig. 2 includes: at least one processor 110, at least one network interface 120, a user interface 130, and memory 150. The various components in server 100 are coupled together by a bus system 140. It is understood that the bus system 140 is used to enable connected communication between these components. The bus system 140 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 140 in fig. 2.
The Processor 110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 130 includes one or more output devices 131, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 130 also includes one or more input devices 132 including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 150 optionally includes one or more storage devices physically located remotely from processor 110. The memory 150 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 150 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 151 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 152 for communicating to other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 including: bluetooth, wireless-compatibility authentication (Wi-Fi), and Universal Serial Bus (USB), etc.;
an input processing module 153 for detecting one or more user inputs or interactions from one of the one or more input devices 132 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 illustrates a data alignment apparatus 154 of a file stored in the storage 150, where the data alignment apparatus 154 of the file may be a data alignment apparatus of a file in the server 100, which may be software in the form of programs and plug-ins, and includes the following software modules: processing module 1541, reconciliation module 1542, which are logical and therefore can be combined arbitrarily or further split depending on the functionality implemented. The functions of the respective modules will be explained below.
In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the data comparison method of the file provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate arrays (FPGAs), or other electronic components.
The data comparison method for the file provided by the embodiment of the present application will be described below with reference to an exemplary application and implementation of the server 100 provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is an alternative flowchart of a data comparison method for documents provided in the embodiment of the present application, which will be described with reference to the steps shown in fig. 3,
step S201, performing geometric splitting on the acquired reconciliation file to obtain N split files.
In the embodiment of the application, under the condition that the reconciliation file is obtained, the reconciliation file, namely the large file, is divided into the subfiles according to the block equal ratio according to the large file fragment analysis processing algorithm to obtain N divided files, and the divided files are finished by line breaks. Here, splitting the account checking file into subfiles can use the advantages of parallel computing of a distributed system, and meanwhile, each subfile is processed, so that the processing performance is accelerated.
In other embodiments of the present application, referring to fig. 4, if the reconciliation file is relatively small, for example, a file smaller than 10MB, the reconciliation file does not need to be split, and when the reconciliation is to be performed, the difference comparison algorithm provided by the present application is directly used for data comparison. In general, reconciliation files are large files, for example, files larger than 10MB, and the large files are split into N sub-files by an equal ratio, and then wait for further processing, for example, data partitioning, which is file partitioning according to customer dimensions, as described below.
Step S202, dividing the N split files into M data partitions according to the user identification of the transaction related to the account checking file.
Each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files. Here, the user identifier assigned to the user by the server, for example, an account number, is associated with a partition number of the data partition, so that the N split files can be divided into M data partitions according to the customer dimension.
In the embodiment of the application, when the user registers the account, the server generates a globally unique account Identifier (ID), and the account ID includes the partition number to which the user belongs. For example, the 16-digit account ID of a user is 0010000000000001, the first three 001 digits are partition numbers, the last 13 digits are the auto-increment sequence under the current partition, and the account is needed for each transaction operation of the user.
The system is deployed on a server and is partitioned according to clients, for example, the existing 40 partitions are partitioned, when different clients register accounts in a micro-banking bank, the different clients register one partition in the 40 partitions according to preset rules. Illustratively, referring to FIG. 5, an illustration of partitioning a split file of the N split files into three data partitions of the M data partitions, each of the three data partitions having partition numbers 001, 002, and 003, is shown in FIG. 5.
That is to say, according to the method and the device, after the obtained reconciliation file is split in an equal ratio to obtain N split files, each split file in the N split files is read and analyzed line by line, file partitioning is carried out according to a system partition to which a user account belongs to generate intermediate fragment subfiles, and each partition generates a plurality of subfile fragment sets. Here, the data in the N files needs to be divided into M partitions, and each of the fragmented files under the partitions is also stored in a certain size, which is 10MB as an example. It should be noted that the files generated by the file partition in this step are unordered, and only one data split is performed according to the partition to which the user account belongs, after the partition is reached, data is written to one file first, and if the file is larger than the set size value, a second file is newly written until all the data are written into the file of the designated partition.
Step S203, according to the transaction type and the transaction time information, performing data cleaning classification on the m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted.
The method considers that the account checking files are generally unordered, the account checking files are directly processed with larger probability failure, the account checking files need to be analyzed and processed for two times or more, however, the processing sequence of the transaction types is required, therefore, according to the method, the M sub-files in each data partition of the M partitions are cleaned and classified according to the transaction types and the transaction time information, all the cleaned and classified files are split in an equal ratio mode, n files to be sorted are obtained, and here, the data are cleaned and classified according to two factors of the transaction types and the transaction time information, and the sorting efficiency is effectively improved.
And step S204, sequencing the n files to be sequenced in the M data partitions according to the transaction time information to obtain n sequenced files.
In the embodiment of the application, under the condition that n files to be sorted in M data partitions are obtained by cleaning and sorting data, transaction time information is used as a sorting reference dimension to sort the n files to be sorted in the M data partitions, and each file obtained after the n sorted files are sorted is stored according to a certain size, wherein 2MB is taken as an example, so that the accuracy of file processing is improved by sorting the files in the partitions.
Step S205, based on the difference comparison algorithm, the data comparison is performed on the n sorted files.
In an implementation embodiment, referring to fig. 6, fig. 6 shows an overall process of a file data comparison method of the present application, and first, a reconciliation file, i.e., a large file, is split into N sub-files in an equal ratio; and finally, performing service logic data processing such as sorting processing on each subfile in the N subfiles according to the partitions, and further performing data comparison on the processed data. The mode of splitting earlier and then sequencing provided by the application processes the big file, and improves the file reading efficiency and the accuracy of account checking.
According to the file data comparison method, N split files are obtained by performing equal-ratio splitting on the obtained account checking files; dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files; according to the transaction type and the transaction time information, performing data cleaning classification on m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted; sorting n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files; based on a difference comparison algorithm, performing data comparison on the n sorted files; that is to say, this application is split to the account checking file earlier, realizes big file fragmentation analysis processing for the processing performance has been accelerated, and further, arranges in order to the file in the subregion, has improved the precision of file processing, has avoided directly handling unordered file great probability to lead to handling the phenomenon of failing.
In some embodiments, in step S203, according to the transaction type and the transaction time information, data cleaning and classifying are performed on m sub-files in each data partition, and all cleaned and classified files are split in an equal ratio to obtain n files to be sorted, which may be implemented by the following steps:
a11, reading each subfile in the m subfiles, and traversing the transaction type and transaction time information of each row of data in each subfile.
And A12, processing all the line data in each subfile according to the cleaning classification conditions with the jth transaction type and one hour of transaction time information to obtain all cleaned and classified files.
Wherein the transaction types include a jth transaction type. It should be noted that the data in all the files after the sorting is washed out is unordered. In the embodiment of the application, the data are sorted according to the transaction time.
In the embodiments of the present application, the transaction types include at least procurement and redemption.
And reading each subfile aiming at each partition in the M data partitions, traversing each row, cleaning and classifying the data according to the transaction type and the transaction time range, such as one hour, and storing the data into different files. Illustratively, the data fields of each line of the file in the reconciliation document of the present application are separated by "|", where some key fields are listed, in the following format: the number of the transaction stream | the user account | the transaction type | the transaction date | the transaction time | the transaction amount | the transaction share | the remark.
Referring to FIG. 7, for document processing within a partition, data for different transaction types is deposited into a designated document according to transaction time periods, such as one hour, where the transaction type field 0 is purchase-applied, 1 is redemption, and non-critical fields are replaced with. FIG. 7 shows the sorted documents after cleaning the documents in the partition, including: acquiring 09-hour transaction data, namely transaction data with the transaction type of acquiring and the transaction time of 09 hours; redemption of 09 hours of transaction data, i.e., transaction data with transaction type redeemed and transaction time 09 hours; and redeeming the 10-hour transaction data, i.e., transaction data having a transaction type of redemption and a transaction time of 10 hours.
And A13, carrying out equal ratio splitting on all the cleaned and classified files to obtain W split files.
The W split files comprise W files to be sorted, wherein the W files are provided with jth transaction types and transaction time information is one hour, and the W files corresponding to all transaction types form n files to be sorted.
Here, the file data within each partition is grouped into different files according to transaction type and hour range. After data cleaning and classification, the file data in each partition is collected into different files according to transaction types and hour ranges. The data volume of the transaction may be larger in some hours, and according to the previous file splitting principle, after the file reaches the size of 10MB, the data is split into the second file for storage. There may be w files, i.e. many transaction files, under the same hour file for the same transaction type.
In some embodiments, step S204 ranks n files to be ranked in the M data partitions according to the transaction time information to obtain n ranked files, and may be implemented by the following steps:
a21, numbering each of the w files to obtain a plurality of files numbered 1 through w.
For w files of a certain transaction type and a certain hour, each file is 10MB for example, the w files of one hour are numbered from 1 to w, as shown in fig. 8, the transaction type is purchase and the transaction time is 09 points, w files are corresponding, each file in the w files is numbered, and a plurality of files with the numbers from 1 to w are obtained, which includes: the 09 procurement hours transaction data files 1, 2, 3 … … 09, and w are the 09 procurement hours transaction data files. The data in all files is now unordered. In the present application, the file "{ transaction type } _ transaction period _ file number" is named in the following manner. For example, file name 0_09_000001 for 09 hours was purchased, file name 1_09_000001 for 09 hours was redeemed, and numbering was incremented using 6-bit integers.
A22, reading the file blocks with the preset size in the file with the serial number i in parallel each time aiming at the files with the serial numbers from 1 to w according to the file memory mapping mode to obtain a plurality of file blocks with the same serial number i.
In the embodiment of the application, in the sorting process, files with different numbers in the w files are subjected to parallel sorting processing. Here, the case of sorting the file of the number i, for example, the file of the number 1, is described as an example, and the files of other numbers are sorted in the same manner. For a file size of 10MB for number 1, each time a 2MB block of the file is read, the file number 1 would be equally divided into 5 block reads.
A23, reading a file block k in a plurality of file blocks with the same number i, and analyzing each line of data in the file block k in parallel to obtain the transaction time information of each line of data in the file block k.
Illustratively, reading a file block k in a plurality of file blocks with the same size of the number 1, and analyzing each line of data in the file block k in parallel to obtain transaction time information of each line of data in the file block k, wherein reading the first block of 2MB of data, analyzing each line of data line by line, and obtaining the transaction time as a basis for sorting.
And A24, if the (i + 1) th line of data in the file block k is read, comparing the (i + 1) th line of data with the previous (i) th line of data, determining the target position of the (i + 1) th line of data in the file block k, and inserting the (i + 1) th line of data into the target position to obtain the sorted file block k.
The transaction time of the (i + 1) th line of data at the target position in the sorted file blocks k is after the transaction time of the (i) th line of data at the previous adjacent position of the target position and before the transaction time of the (i + 2) th line of data at the next adjacent position of the target position.
Here, for the file block k, each line of data is read, a position equal to or more than the preceding time and less than the following time is found as compared with the preceding data of the line of data read in the file block k, the line of data is inserted into the position, and the following data is shifted backward by one line. And rewriting the first 2MB file block of the file with the number 1 of the sorted file block to realize the sorting of the file block k.
Illustratively, referring to FIG. 9, for file block 1, file block 1 contains 6 rows of data, and after reading the first row of data, the first row of data is compared only with the next row of data, and the location of the first row of data is unchanged because 090002 is smaller than the next row 092005; after reading the second line of data, 092005 is longer than 090002, 092005 is still longer than 090102 when compared with the data of the next line, which indicates that the position of the line of data corresponding to 092005 should be exchanged with the position of the line of data corresponding to 090102, after the exchange, the data of the third line is read, 092005 is longer than 090102 and shorter than 094002, the subsequent lines of data are continuously read for sorting until the time corresponding to each line of data in the file block 1 is longer than or equal to the time before and shorter than the time after, the sorting for the file block 1 is completed, and the sorted file block 1 is as shown in fig. 10.
In the embodiment of the present application, a plurality of sorted file blocks are obtained by sorting each file block k in a plurality of file blocks with the same number i, for example, 5 sorted file blocks are obtained for file block 1.
And aiming at the file blocks 2, 3, 4 and 5 with the number 1, sequencing by adopting the same sequencing method as the file 1, rewriting the file with the number i in each sequenced file block k, and persisting the inner arrangement result on a disk to realize the sequencing aiming at each file block.
A25, based on the sorting mode of multi-row matching, sorting all the file blocks with the serial number i after sorting to obtain all the file blocks with the serial number i after sorting.
In the embodiment of the present application, a24 obtains file blocks that are respectively sorted for file block k, and further, sorts the file blocks by using multi-row matching, so as to realize sorting among a plurality of file blocks with number i.
Here, for the sorting between file block 1 and file block 2, file block 1 and file block 2 are read, m consecutive rows of file block 2 (where m may be 1) are found at positions in file block 1 that are greater than or equal to the front time and less than the rear time, the m rows are inserted into the positions, the rear rows are moved backward from the first file block, the last m rows of file block 1 are moved backward by m rows toward file block 2, the m rows moved backward to file block 2 are also compared in the opposite direction, and the positions moved to the corresponding sorting of file block 2 are placed, so that the purpose of sorting two file blocks is achieved.
Illustratively, referring to fig. 11, reading file block 1 and file block 2, finding the position of 2 consecutive lines of file block 2 in file block 1, which is greater than or equal to the previous time and less than the next time, i.e. the position of 4 th-5 th lines in file block 1, inserting 2 consecutive lines of file block 2 into this position, and moving the next lines backwards from the first file block, and moving 7 th-8 th lines in file block 1 downwards to file block 2 by m lines, and moving 2 lines downwards to file block 2 are also compared in reverse direction, finding the position of 7 th-8 th lines in file block 1, which is greater than or equal to the previous time and less than the next time in file block 2, for example, the 7 th line after moving in file block 1 should be inserted into the 4 th line in file block 2, the 8 th line after moving in file block 1 should be inserted into the 6 th line in file block 2, and moving to the position of the file block 2 corresponding to the sorting and putting the file block into the sorting device to realize the purpose of sorting the two file blocks.
Thus, file block 1 and file block 2 are sorted again in memory, and the sorted results are written back to the first and second 2MB file blocks numbered 1.
Similarly, referring to fig. 12, the file block 3 is compared with the file block 1 and the file block 2 respectively, a suitable position is found, m rows in the file block 3 are moved to a suitable position of the file block 1 or the file block 2, correspondingly, the redundant m rows after insertion are moved down to the file block 2 or the file block 3, if the redundant m rows are moved to the file block 2, the residual m rows after sorting are continuously moved down to the file block 3, and finally, the purpose of sorting the three file blocks is achieved.
File blocks 4 and 5 are also processed in the same way, the file blocks 4 are respectively compared with the file blocks 1, 2 and 3 to select proper positions for insertion sorting, and the file blocks 5 are respectively compared with the file blocks 1, 2, 3 and 4 to select proper positions for insertion sorting. The ordering between the last number 1 file blocks is complete.
A26, sorting the files with numbers from 1 to w based on the sorting mode of multi-line matching to obtain n sorted files.
In the embodiment of the application, the files with the numbers from 2 to w are sorted by using the sorting mode of the files with the numbers from 1, the respective sorting of the files with different numbers can be processed in parallel, and the sorting is realized by using technologies such as multithreading or distributed clustering, and the like, so that the w files are sorted independently.
Further, in the method for expanding the number 1 ordering, for the ordering of a plurality of file blocks, as shown in fig. 13, p rows of data of the file block 1 of the file number 2 are compared with the file block 1 of the file number 1, if there is a suitable position, the p rows of data are inserted into the position, then all the file blocks of the file number 1 are shifted down by the p rows of data, and the file blocks are shifted into the file number 2, and the proper position is continuously found and stored in a reverse direction. And repeating the steps circularly, and sequencing all data rows of all file blocks of the file number 1 and the file number 2.
Further, the file numbers 3 to n are also operated in this way, and the ordering of the files for the subscription 09 hours is finally completed by comparing with the file numbers 1 and 2.
In some embodiments, the step S205 performs data comparison on the n sorted files based on the difference comparison algorithm, which may be implemented by the following steps:
a31, exporting the file i from the database according to the first row transaction time field and the last row transaction field of all the file blocks with the serial number i after sequencing.
And all the sorted file blocks with the number i have the same data partition identification as the database file i.
In the embodiment of the application, in the account checking process, data is exported from the database according to the same rule of the aforementioned data cleaning, and subfiles are exported according to the framed data type and the transaction time range. The data is exported from the database by first reading the time range of the sorted files in the partition, taking a single file as an example, and directly reading the transaction time fields of the head row and the tail row of the file. Second, transaction data is derived from the ranges.
In the embodiment of the application, in the process of exporting the file from the database, the transaction type can be obtained according to the file name, so that the time range can be framed when the transaction record is exported from the database, and meanwhile, the database script can be directly sequenced.
Further, for exported files, file names may be named by rules, with the database-exported file naming rules prefixed with the "db _" prefix before the preceding reconciliation file name. For example, if the sorted reconciliation file name in the partition is "0 _09_ 000001", the database export file name is "db _0_09_ 000001"
In some embodiments, there are two cases of sorting database export files, the first being that if a partition has only one database, then the database export files in step have already been sorted by rule. Secondly, if a partition uses a plurality of databases, there are a plurality of database export files corresponding to one reconciliation file, for example, referring to fig. 14, three databases are taken as an example, i.e. the reconciliation file name is "0 _09_ 000001", and the database export file name is "db 1_09_ 000001", "db 2_0_09_ 000001", and "db 3_0_09_ 000001". At this time, the data rows of the three files are sorted and combined into one file. The file ordering algorithm used above is used here.
Illustratively, referring to fig. 15 and 16, the database file 1 is exported from the database according to the head line transaction time field 090002 and the tail line transaction time field 095716 of all file blocks numbered 1, and the file name of the exported database file 1 is db _09_ 000001. Here, before the export, the transaction type may also be judged, and if the transaction types do not match, the export is stopped.
A32, calculating the first hash value of all the sorted file blocks with the number i based on the difference comparison algorithm.
In the embodiment of the present application, the difference alignment algorithm includes, but is not limited to, message digest algorithm md 5.
A33, calculating a second hash value of database file i based on the difference comparison algorithm.
In this embodiment of the application, a difference comparison algorithm based on a message digest algorithm is used, for each partition shown in fig. 17, a file comparison difference corresponding to each partition is obtained, and a file with a difference is screened out, referring to fig. 18, a value H is calculated for each file by using a message digest algorithm md5, taking two files 0_09_000001 and db _0_09_000001 as an example, if H calculated by the reconciliation file 0_09_000001 and the database export file db _0_09_000001 are the same, it indicates that the reconciliation is consistent, no processing is required, and the reconciliation is directly eliminated.
And if the comparison between the H calculated by the reconciliation file 0_09_000001 and the database export file db _0_09_000001 is different, the reconciliation is inconsistent, and the next processing is required to be continued.
A34, if the first hash value is different from the second hash value, determining that all the file blocks with the number i after sorting are different from the database file i.
A35, based on the data matching algorithm, removing the same file blocks in all the sorted file blocks with the number i from the database file i, and screening out the first difference file block in the database file i and the second difference file block in all the sorted file blocks with the number i.
Here, after comparing the values of the file md5, the completely consistent file is deduplicated, and the rest are files in which the reconciliation file and the database export file have differences. In a differential file, because the file is sorted, most of the blocks in the consecutive rows may be equal, and the same file block may be removed by using a file block data matching method, leaving a differential portion.
In the embodiment of the application, after the comparison of the md5 values of the files, the completely consistent files are removed, and the files with differences between the reconciliation files and the database export files are left.
And A36, determining difference information between a first difference file block in the database file i and a second difference file block in all the sorted file blocks with the number i, and performing data comparison based on the difference information.
In some embodiments, a35, based on a data matching algorithm, removes the same file block as the file block with the number i from the database file i, and filters out a first difference file block in the database file i and a second difference file block in all the file blocks with the number i, which may be implemented by the following steps:
and A351, if the row number of all the sequenced file blocks with the number i is different from the row number of the data contained in the database file i, removing the head row data and the tail row data of all the sequenced file blocks with the number i and the database file i at least once to obtain all the sequenced file blocks with the number i and the file blocks with the number i removed.
And all the sorted file blocks with the serial number i after the line removal and the file blocks with the serial number i after the line removal of the database file i have the same transaction time interval.
And A352, calculating third hash values of all the sorted file blocks with the serial number i after line removal based on a difference comparison algorithm.
And A353, calculating a fourth hash value of the file block of the database file i after the line is removed based on a difference comparison algorithm.
And A354, if the third hash value is different from the fourth hash value, determining that all the sorted file blocks with the serial number i after line removal are different from the file blocks with the serial number i after line removal.
And A355, based on a data matching algorithm, removing the same file blocks in all the file blocks which are sorted after the line removal of the serial number i from the file blocks which are subjected to the line removal of the database file i, and screening out third difference file blocks in the file blocks which are subjected to the line removal of the database file i and fourth difference file blocks in all the file blocks which are sorted after the line removal of the serial number i.
A356, removing tail data of a fourth difference file block with the number i and a third difference file block of the database file i at least once, and screening out a first difference file block in the database file i and a second difference file block with the number i; and the first difference file block in the database file i and the second difference file block with the number i have the same tail line transaction time.
Referring to fig. 19, 20, and 21, 1), taking account file 0_09_0000002 and database export file db _0_09_000002 as an example, first comparing the number of rows of two files, there may be generally three cases as follows:
in the first case, the number of file lines 0_09_0000002 is greater than db _0_09_ 000002.
In the second case, the number of file lines 0_09_0000002 is equal to db _0_09_ 000002.
In the third case, the number of file lines 0_09_0000002 is less than db _0_09_ 000002.
2) Continuously comparing the first lines of the two files, and removing the line with the smaller transaction time of the first line in the two files
3) Then comparing the two file tail lines, and removing the line with longer transaction time of the tail line in the two files
4) And after the operation of removing the head line and the tail line for multiple times, until the transaction time of the head line of the two files is equal, the transaction time of the tail line of the two files is also equal, and the line number of the data of the two files is equal, using a message digest algorithm sha1 to take and compare sha1 of the two files.
5) If the two sha1 values are equal, the equal file blocks in the two files are removed, leaving the previously excluded rows in the two files. Go to step 7)
6) If the two sha1 values are not equal, the two files remove the end row at the same time, continue to compare the transaction time of the end row, wait again until the transaction time of the end row of the two files is equal and the row number is equal, and go back to the step 5)
7) And circulating the step 1) to the step 6) until no identical file block exists.
Further, two associative arrays may be used to store the sha1 values and data rows of two difference files, respectively.
In some embodiments, a36 determines difference information between a first difference file block in a database file i and a second difference file block in all sorted file blocks with number i, and performs data comparison based on the difference information, which may be implemented by the following steps:
and A361, calculating a fifth hash value of the first difference file block in the database file i based on the difference comparison algorithm, and recording a data row of the first difference file block in the database file i as a value as a first association array of the database file i by taking the fifth hash value as a key.
And A362, calculating a sixth hash value of the second difference file block with the number i based on the difference comparison algorithm, and recording the sixth hash value as a key and the data line of the second difference file block with the number i as a value as a second association array with the number i.
And A363, in each partition, comparing the keys of the first associated array of the database file i with the keys of the second associated array of the number i, and removing the data rows with the same keys in the two associated arrays to obtain a third associated array of the database file i and a fourth associated array of the number i.
Wherein, the key in the third associated array of the database file i is the transaction serial number of each row of data and the value is the data row; the key in the fourth associative array of number i is the transaction serial number for each row of data and the value is the row of data.
Illustratively, as shown in FIG. 22, an associated array of keys as transaction serial numbers for each row of data and values as rows of data is obtained.
In the embodiment of the application, the second associated array of the number i is represented by an associated array a, the first associated array of the database file i is represented by an associated array B, the fourth associated array of the number i is represented by an associated array C, and the third associated array of the database file i is represented by an associated array D.
And A364, determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and performing data comparison based on the difference information.
In some embodiments, a364 determines difference information between the third associated array of the database file i and the fourth associated array of the number i, and performs data comparison based on the difference information, which may be implemented by the following steps:
a3641, if a first key which does not exist in the third associated array of the database file i exists in the fourth associated array of the number i, determining that the difference information represents that the account checking file has data which does not exist in the database, determining a data row corresponding to the fourth associated array of the number i based on the first key, and adding the data row corresponding to the fourth associated array of the number i in the third associated array of the database file i.
A3642, if a second key existing in the third associated array of the database file i does not exist in the fourth associated array of the number i, determining that the difference information represents that the reconciliation file has no data existing in the database, determining a data row corresponding to the third associated array of the database file i based on the second key, and deleting the data row corresponding to the third associated array of the database file i.
A3643, if a third key exists in the third associated array of the database file i in the fourth associated array of the number i, determining that the difference information represents that transaction data exists in both the reconciliation file and the database, and the transaction data are inconsistent, and replacing the data row corresponding to the third key in the fourth associated array of the number i with the data row corresponding to the third key in the fourth associated array of the number i.
In one practical embodiment, referring to fig. 23, the implementation of reconciliation of the present application is further explained with reference to associative array a, associative array B, associative array C and associative array D, where after all files are processed, each partition only has two different associative arrays left after comparing the reconciliation file with the database file. And marking the difference association array generated by the reconciliation file as an association array A, and marking the difference association array generated by the database file as an association array B. By now, there has been little data, but there is still a relatively small chance that the same file line may exist.
Within each partition, two associative arrays are traversed, respectively. And comparing the keys of all the related arrays A with the keys of all the related arrays B, and excluding the data with the same key in the two related arrays. And remapping the surplus data in the two association arrays A and B into a new association array C and an association array D, taking the unique transaction serial number of each row of data as a key, and taking the data row as a value.
Comparing the keys of all associated arrays C with the keys of all associated arrays D, there are three cases:
firstly, if the associated array C exists and the associated array D does not exist, a key which indicates that the account checking file has data which is not available in the database needs to be used for re-associating the data row corresponding to the array C, and the data row corresponding to the array C associated with the key is added into the associated array D, so that the transaction serial number is newly added into the associated array D, and the transaction data is added aiming at the newly added transaction serial number.
Secondly, the association array C does not exist, and the key of the association array D does not exist, which indicates that the account checking file does not have data of the database, the key is needed to be used for re-associating the data row corresponding to the array D, and the data row corresponding to the array D associated with the key is deleted, so that the deletion of redundant and incorrect transaction information in the database is realized.
Thirdly, the existence of the associated array C and the existence of the associated array D in the key indicate that the data in the account file and the database exist, but the transaction data are inconsistent (because the algorithm for consistent data is removed), the data row corresponding to the array C needs to be re-associated by using the key, and the data row corresponding to the array D associated by the key is replaced by the data row corresponding to the array C associated by the key, so that the data in the database is ensured to be consistent with the data in the account file.
According to the file and data difference comparison algorithm, sequencing in a large file fragment analysis processing algorithm and partition design are adopted, parallel calculation of a distributed system is convenient to use, and the matching efficiency of file block data is high; meanwhile, the file processing accuracy is high, and the probability of file processing failure is greatly reduced.
Continuing with the exemplary structure of the data comparison device 154 of the file provided in the embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 2, the software module in the data comparison device 154 of the file stored in the memory 150 may be a data comparison device of a file in the server 100, including:
the processing module 1541 is configured to split the acquired reconciliation file in an equal ratio to obtain N split files;
the processing module 1541 is configured to divide the N split files into M data partitions according to a user identifier of an exchange associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files;
the processing module 1541 is configured to perform data cleaning and classification on m sub-files in each data partition according to the transaction type and the transaction time information, and perform equal-ratio splitting on all cleaned and classified files to obtain n files to be sorted;
the processing module 1541 is configured to sort, according to the transaction time information, the n files to be sorted in the M data partitions to obtain n sorted files;
the reconciliation module 1542 is configured to perform data comparison on the n sorted files based on a difference comparison algorithm.
In some embodiments, the processing module 1541 is configured to read each subfile of the m subfiles, and traverse the transaction type and transaction time information of each line of data in each subfile; processing all the line data in each subfile according to cleaning classification conditions with a jth transaction type and one hour of transaction time information to obtain all cleaned and classified files; the transaction types comprise jth transaction types; carrying out equal ratio splitting on all the cleaned and classified files to obtain W split files; the W split files comprise W files to be sorted, wherein the W files are provided with jth transaction types and transaction time information is one hour, and the W files corresponding to all transaction types form n files to be sorted.
In some embodiments, the processing module 1541 is configured to number each file of each w files, to obtain a plurality of files numbered from number 1 to number w; according to a file memory mapping mode, for files with numbers from 1 to w, reading file blocks with preset sizes in the file with the number i in parallel every time to obtain a plurality of file blocks with the same size with the number i; reading a file block k in a plurality of file blocks with the same number i, and analyzing each line of data in the file block k in parallel to obtain transaction time information of each line of data in the file block k; if the (i + 1) th line of data in the file block k is read, comparing the (i + 1) th line of data with the previous (i) th line of data, determining the target position of the (i + 1) th line of data in the file block k, and inserting the (i + 1) th line of data into the target position to obtain a sorted file block k; the transaction time of the (i + 1) th line of data at the target position in the sorted file block k is after the transaction time of the (i) th line of data at the previous adjacent position of the target position and before the transaction time of the (i + 2) th line of data at the next adjacent position of the target position; sequencing all the sequenced file blocks with the number i based on a multi-row matching sequencing mode to obtain all the sequenced file blocks with the number i; and sequencing the files with the numbers from 1 to w based on a sequencing mode of multi-line matching to obtain n sequenced files.
In some embodiments, the reconciliation module 1542 is configured to derive a database file i from the database according to the head line transaction time field and the tail line transaction field of all the file blocks sorted by the number i; all the sequenced file blocks with the number i have the same data partition identification with the database file i; calculating first hash values of all the sorted file blocks with the number i based on a difference comparison algorithm; calculating a second hash value of the database file i based on a difference comparison algorithm; if the first hash value is different from the second hash value, determining that all the file blocks with the serial number i after sequencing are different from the database file i; based on a data matching algorithm, removing the same file blocks in all the sorted file blocks with the number i from the database file i, and screening out first difference file blocks in the database file i and second difference file blocks in all the sorted file blocks with the number i; and determining difference information between a first difference file block in the database file i and a second difference file block in all the sorted file blocks with the serial number i, and comparing data based on the difference information.
In some embodiments, the reconciliation module 1542 is configured to, if the number of rows of the sorted all file blocks with the number i is different from the number of rows of the data included in the database file i, remove the head row data and the tail row data of all the sorted file blocks with the number i and the database file i at least once to obtain all the sorted file blocks with the number i after the line removal and the file blocks with the number i after the line removal of the database file i; all the sorted file blocks with the serial number i after the line removal and the file blocks with the serial number i after the line removal of the database file i have the same transaction time interval; calculating third hash values of all the sorted file blocks with the serial number i after line removal based on a difference comparison algorithm; calculating a fourth hash value of the file block of the database file i after the line is removed based on a difference comparison algorithm; if the third hash value is different from the fourth hash value, determining that all the sorted file blocks with the serial number i after the line removal are different from the file blocks with the serial number i after the line removal; based on a data matching algorithm, removing the same file blocks in all the file blocks with the serial number i after the row removal and the sequencing from the file blocks with the row removal of the database file i, and screening out third difference file blocks in the file blocks with the row removal of the database file i and fourth difference file blocks in all the file blocks with the serial number i after the row removal and the sequencing; removing tail data of a fourth difference file block with the number i and a third difference file block of the database file i at least once, and screening out a first difference file block in the database file i and a second difference file block with the number i; and the first difference file block in the database file i and the second difference file block with the number i have the same tail line transaction time.
In some embodiments, the reconciliation module 1542 is configured to calculate a fifth hash value of the first difference file block in the database file i based on the difference comparison algorithm, and record the fifth hash value as a key and the data row of the first difference file block in the database file i as a value as the first associative array of the database file i; calculating a sixth hash value of the second difference file block with the number i based on a difference comparison algorithm, and recording a data line of the second difference file block with the number i as a value as a second association array of the number i by taking the sixth hash value as a key; in each partition, comparing the keys of the first associated array of the database file i with the keys of the second associated array of the serial number i, and removing the data rows with the same keys in the two associated arrays to obtain a third associated array of the database file i and a fourth associated array of the serial number i; wherein, the key in the third associated array of the database file i is the transaction serial number of each row of data and the value is the data row; keys in a fourth associated array of the serial number i are transaction serial numbers of each row of data, and the values are data rows; and determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and performing data comparison based on the difference information.
In some embodiments, the reconciliation module 1542 is configured to determine, if a first key that does not exist in the third associated array of the database file i exists in the fourth associated array of the number i, that the difference information represents that the reconciliation file has data that does not exist in the database, determine, based on the first key, a data row corresponding to the fourth associated array of the number i, and add, to the third associated array of the database file i, the data row corresponding to the fourth associated array of the number i; if a second key existing in a third associated array of the database file i does not exist in a fourth associated array of the serial number i, determining that the difference information represents that the reconciliation file does not have data existing in the database, determining a data row corresponding to the third associated array of the database file i based on the second key, and deleting the data row corresponding to the third associated array of the database file i; and if a third key exists in the third associated array of the database file i in the fourth associated array of the serial number i, determining that the difference information represents that the reconciliation file and the database have transaction data which are inconsistent, and replacing the data row corresponding to the third key in the fourth associated array of the serial number i with the data row corresponding to the third key in the fourth associated array of the serial number i.
According to the file data comparison device, N split files are obtained by performing equal-ratio splitting on the obtained account checking files; dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files; according to the transaction type and the transaction time information, performing data cleaning classification on m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted; sorting n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files; based on a difference comparison algorithm, performing data comparison on the n sorted files; that is to say, this application is split to the account checking file earlier, realizes big file fragmentation analysis processing for the processing performance has been accelerated, and further, arranges in order to the file in the subregion, has improved the precision of file processing, has avoided directly handling unordered file great probability to lead to handling the phenomenon of failing.
It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.
Embodiments of the present application provide a storage medium storing executable instructions, which when executed by a processor, will cause the processor to execute the method provided by the embodiments of the present application.
In some embodiments, the storage medium may be a computer-readable storage medium, such as a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEPROM), a flash Memory, a magnetic surface Memory, an optical disc, or a Compact disc Read Only Memory (CD-ROM), among other memories; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (hypertext Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. A method for comparing data of a file is characterized by comprising the following steps:
carrying out equal ratio splitting on the obtained account checking files to obtain N split files;
dividing the N split files into M data partitions according to the user identification of the transaction place associated with the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files;
according to the transaction type and the transaction time information, performing data cleaning classification on the m sub-files in each data partition, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted;
sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and comparing the data of the n sorted files based on a difference comparison algorithm.
2. The method according to claim 1, wherein the cleaning and classifying the m sub-files in each data partition according to transaction type and transaction time information, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted comprises:
reading each subfile in the m subfiles, and traversing the transaction type and the transaction time information of each row of data in each subfile;
processing all the line data in each subfile according to a cleaning classification condition which has a jth transaction type and the transaction time information is one hour to obtain all the cleaned and classified files; wherein the transaction type comprises the jth type of transaction;
carrying out equal ratio splitting on all the cleaned and classified files to obtain W split files; the W split files comprise W files to be sorted, wherein the W split files have jth transaction types and the transaction time information is one hour, and the W files corresponding to all the transaction types form the n files to be sorted.
3. The method according to claim 2, wherein said sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files comprises:
numbering each file in each w file to obtain a plurality of files with numbers from 1 to w;
according to a file memory mapping mode, for files with the number i from the number 1 to the number w, reading file blocks with a preset size in the file with the number i in parallel every time to obtain a plurality of file blocks with the same size of the number i;
reading a file block k in a plurality of file blocks with the same number i, and analyzing each line of data in the file block k in parallel to obtain transaction time information of each line of data in the file block k;
if the (i + 1) th line of data in the file block k is read, comparing the (i + 1) th line of data with the previous (i) th line of data, determining the target position of the (i + 1) th line of data in the file block k, and inserting the (i + 1) th line of data into the target position to obtain the ordered file block k; wherein, the transaction time of the (i + 1) th line of data in the sorted file block k at the target position is after the transaction time of the (i) th line of data at the position adjacent to the target position before the transaction time of the (i + 2) th line of data at the position adjacent to the target position;
sequencing all the sequenced file blocks of the number i based on a multi-row matching sequencing mode to obtain all the sequenced file blocks of the number i;
and sequencing the files with the numbers from 1 to w based on a sequencing mode of multi-row matching to obtain the n sequenced files.
4. The method of claim 3, wherein the comparing the n sorted files based on the difference comparison algorithm comprises:
exporting a file i from a database according to the head line transaction time field and the tail line transaction field of all the file blocks with the serial number i after sequencing; all the sequenced file blocks with the number i have the same data partition identification with the database file i;
calculating first hash values of all the sorted file blocks with the number i based on the difference comparison algorithm;
calculating a second hash value of the database file i based on the difference comparison algorithm;
if the first hash value is different from the second hash value, determining that all the file blocks with the serial number i after sequencing are different from the database file i;
based on a data matching algorithm, removing file blocks which are the same as all the file blocks of the serial number i after sequencing from the database file i, and screening out a first difference file block in the database file i and a second difference file block in all the file blocks of the serial number i after sequencing;
and determining difference information between a first difference file block in the database file i and a second difference file block in all the file blocks with the serial number i after sequencing, and performing data comparison based on the difference information.
5. The method according to claim 4, wherein the screening out a first differential file block in the database file i and a second differential file block in all the sorted file blocks with the number i based on the data matching algorithm by removing the same file block in all the sorted file blocks with the number i from the database file i comprises:
if the number of rows of all the sequenced file blocks with the number i is different from the number of rows of data contained in the database file i, removing all the sequenced file blocks with the number i and the head row data and tail row data in the database file i at least once to obtain all the sequenced file blocks with the number i and the file blocks with the database file i; all the sorted file blocks with the serial number i after the line removal and the file blocks with the serial number i after the line removal have the same transaction time interval;
calculating third hash values of all the sorted file blocks of the serial number i after line removal based on the difference comparison algorithm;
calculating a fourth hash value of the de-rowed file block of the database file i based on the difference comparison algorithm;
if the third hash value is different from the fourth hash value, determining that all the sorted file blocks with the serial number i after being subjected to line removal are different from the file blocks with the serial number i after being subjected to line removal;
based on the data matching algorithm, removing the same file blocks in all the file blocks which are sorted after the line removal of the serial number i from the file blocks which are subjected to the line removal of the database file i, and screening out third difference file blocks in the file blocks which are subjected to the line removal of the database file i and fourth difference file blocks in all the file blocks which are sorted after the line removal of the serial number i;
removing tail data of the fourth difference file block with the number i and the third difference file block of the database file i at least once, and screening out the first difference file block in the database file i and the second difference file block with the number i; wherein the first difference file block in the database file i and the second difference file block of the number i have the same end trade time.
6. The method according to claim 4 or 5, wherein the determining difference information between a first difference file block in the database file i and a second difference file block in all the sorted file blocks with the number i, and performing data comparison based on the difference information comprises:
calculating a fifth hash value of a first difference file block in the database file i based on the difference comparison algorithm, and recording a data row of the first difference file block in the database file i as a value as a first association array of the database file i by taking the fifth hash value as a key;
calculating a sixth hash value of the second difference file block with the number i based on the difference comparison algorithm, and recording a data line of the second difference file block with the number i as a value as a second association array of the number i by taking the sixth hash value as a key;
in each partition, comparing the keys of the first associated array of the database file i with the keys of the second associated array of the serial number i, and removing the data rows with the same keys in the two associated arrays to obtain a third associated array of the database file i and a fourth associated array of the serial number i; wherein, the key in the third associated array of the database file i is the transaction serial number of each row of data and the value is the data row; the key in the fourth associated array of the number i is the transaction serial number of each row of data and the value is the data row;
and determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and performing data comparison based on the difference information.
7. The method according to claim 6, wherein the determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and performing data comparison based on the difference information comprises:
if a first key which does not exist in the third associated array of the database file i exists in the fourth associated array of the number i, determining that the difference information represents that the reconciliation file has data which does not exist in the database, determining a data row corresponding to the fourth associated array of the number i based on the first key, and adding the data row corresponding to the fourth associated array of the number i in the third associated array of the database file i;
if a second key existing in a third associated array of the database file i does not exist in a fourth associated array of the serial number i, determining that the difference information represents that the reconciliation file does not have data existing in the database, determining a data row corresponding to the third associated array of the database file i based on the second key, and deleting the data row corresponding to the third associated array of the database file i;
if a third key exists in the third associated array of the database file i in the fourth associated array of the serial number i, determining that the difference information represents that transaction data exist in both the reconciliation file and the database, and the transaction data are inconsistent, and replacing a data row corresponding to the third key in the fourth associated array of the serial number i with a data row corresponding to the third key in the fourth associated array of the serial number i.
8. A data comparison device for files is characterized by comprising:
the processing module is used for carrying out equal ratio splitting on the acquired reconciliation files to obtain N split files;
the processing module is used for dividing the N split files into M data partitions according to the user identification of the transaction related to the reconciliation file; each data partition in the M data partitions corresponds to one user identifier, and each data partition comprises M sub-files;
the processing module is used for cleaning and classifying the m sub-files in each data partition according to the transaction type and the transaction time information, and performing equal ratio splitting on all cleaned and classified files to obtain n files to be sorted;
the processing module is used for sequencing the n files to be sequenced in the M data partitions according to the transaction time information to obtain n sequenced files;
and the account checking module is used for comparing the data of the n sorted files based on a difference comparison algorithm.
9. A data comparison device for files, comprising:
a memory for storing executable instructions; a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.
10. A computer-readable storage medium having stored thereon executable instructions for causing a processor, when executed, to implement the method of any one of claims 1 to 7.
CN202110724780.4A 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium Active CN113342750B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110724780.4A CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium
PCT/CN2021/140732 WO2023273235A1 (en) 2021-06-29 2021-12-23 Data comparison method, apparatus and device for file, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110724780.4A CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113342750A true CN113342750A (en) 2021-09-03
CN113342750B CN113342750B (en) 2023-11-17

Family

ID=77481343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110724780.4A Active CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113342750B (en)
WO (1) WO2023273235A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656654A (en) * 2021-10-19 2021-11-16 云丁网络技术(北京)有限公司 Method, device and system for adding equipment
CN113837878A (en) * 2021-09-07 2021-12-24 中国银联股份有限公司 Data comparison method, device, equipment and storage medium
CN113886332A (en) * 2021-12-09 2022-01-04 广东睿江云计算股份有限公司 Large file difference comparison method and device, computer equipment and storage medium
CN114363321A (en) * 2021-12-30 2022-04-15 支付宝(杭州)信息技术有限公司 File transmission method, equipment and system
WO2023273235A1 (en) * 2021-06-29 2023-01-05 深圳前海微众银行股份有限公司 Data comparison method, apparatus and device for file, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702024A (en) * 2023-05-16 2023-09-05 见知数据科技(上海)有限公司 Method, device, computer equipment and storage medium for identifying type of stream data
CN116308850B (en) * 2023-05-19 2023-09-05 深圳市四格互联信息技术有限公司 Account checking method, account checking system, account checking server and storage medium
CN116910631B (en) * 2023-09-14 2024-01-05 深圳市智慧城市科技发展集团有限公司 Array comparison method, device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586019B1 (en) * 2014-01-24 2020-03-10 The Pnc Financial Services Group, Inc. Automated healthcare cash account reconciliation method
CN111325617A (en) * 2020-01-22 2020-06-23 北京开科唯识技术有限公司 File-based account checking method and device, computer equipment and readable storage medium
CN112037003A (en) * 2020-09-17 2020-12-04 中国银行股份有限公司 File account checking processing method and device
CN112613964A (en) * 2020-12-25 2021-04-06 深圳鼎盛电脑科技有限公司 Account checking method, account checking device, account checking equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342750B (en) * 2021-06-29 2023-11-17 深圳前海微众银行股份有限公司 File data comparison method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586019B1 (en) * 2014-01-24 2020-03-10 The Pnc Financial Services Group, Inc. Automated healthcare cash account reconciliation method
CN111325617A (en) * 2020-01-22 2020-06-23 北京开科唯识技术有限公司 File-based account checking method and device, computer equipment and readable storage medium
CN112037003A (en) * 2020-09-17 2020-12-04 中国银行股份有限公司 File account checking processing method and device
CN112613964A (en) * 2020-12-25 2021-04-06 深圳鼎盛电脑科技有限公司 Account checking method, account checking device, account checking equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273235A1 (en) * 2021-06-29 2023-01-05 深圳前海微众银行股份有限公司 Data comparison method, apparatus and device for file, and storage medium
CN113837878A (en) * 2021-09-07 2021-12-24 中国银联股份有限公司 Data comparison method, device, equipment and storage medium
CN113837878B (en) * 2021-09-07 2024-05-03 中国银联股份有限公司 Data comparison method, device, equipment and storage medium
CN113656654A (en) * 2021-10-19 2021-11-16 云丁网络技术(北京)有限公司 Method, device and system for adding equipment
CN113656654B (en) * 2021-10-19 2022-05-10 云丁网络技术(北京)有限公司 Method, device and system for adding equipment
CN113886332A (en) * 2021-12-09 2022-01-04 广东睿江云计算股份有限公司 Large file difference comparison method and device, computer equipment and storage medium
CN113886332B (en) * 2021-12-09 2022-02-08 广东睿江云计算股份有限公司 Large file difference comparison method and device, computer equipment and storage medium
CN114363321A (en) * 2021-12-30 2022-04-15 支付宝(杭州)信息技术有限公司 File transmission method, equipment and system

Also Published As

Publication number Publication date
WO2023273235A1 (en) 2023-01-05
CN113342750B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN113342750B (en) File data comparison method, device, equipment and storage medium
US20200356901A1 (en) Target variable distribution-based acceptance of machine learning test data sets
CN101553813B (en) Managing storage of individually accessible data units
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN104731896A (en) Data processing method and system
CN113590606B (en) Bloom filter-based large data volume secret key duplication eliminating method and system
US10599614B1 (en) Intersection-based dynamic blocking
US20070239663A1 (en) Parallel processing of count distinct values
CN108205571B (en) Key value data table connection method and device
CN108280226B (en) Data processing method and related equipment
CN110019017B (en) High-energy physical file storage method based on access characteristics
KR102425595B1 (en) System for performing searching and analysis based on in-memory computing for real-time data processing, analysis method, and computer program
US20200278980A1 (en) Database processing apparatus, group map file generating method, and recording medium
US11789639B1 (en) Method and apparatus for screening TB-scale incremental data
US11301426B1 (en) Maintaining stable record identifiers in the presence of updated data records
US11308130B1 (en) Constructing ground truth when classifying data
JP2002041551A (en) Compile method for data and storage medium storing the same
WO2004038582A1 (en) Data processing method and data processing program
US11016978B2 (en) Joiner for distributed databases
CN112328630A (en) Data query method, device, equipment and storage medium
CN112307029A (en) Bill data storage and bill generation method, device, server and storage medium
CN106776704A (en) Statistical information collection method and device
US7996366B1 (en) Method and system for identifying stale directories
US11126401B2 (en) Pluggable sorting for distributed databases
CN109542900B (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant