CN113342750B - File data comparison method, device, equipment and storage medium - Google Patents

File data comparison method, device, equipment and storage medium Download PDF

Info

Publication number
CN113342750B
CN113342750B CN202110724780.4A CN202110724780A CN113342750B CN 113342750 B CN113342750 B CN 113342750B CN 202110724780 A CN202110724780 A CN 202110724780A CN 113342750 B CN113342750 B CN 113342750B
Authority
CN
China
Prior art keywords
file
data
files
difference
transaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110724780.4A
Other languages
Chinese (zh)
Other versions
CN113342750A (en
Inventor
徐继盛
万磊
李毅
钱进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110724780.4A priority Critical patent/CN113342750B/en
Publication of CN113342750A publication Critical patent/CN113342750A/en
Priority to PCT/CN2021/140732 priority patent/WO2023273235A1/en
Application granted granted Critical
Publication of CN113342750B publication Critical patent/CN113342750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data comparison method of a file, a data comparison device of the file, data comparison equipment of the file and a computer readable storage medium, wherein the method comprises the following steps: performing equal-ratio splitting on the obtained account checking file to obtain N split files; dividing N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles; according to the transaction type and the transaction time information, carrying out data cleaning and classifying on m subfiles in each data partition, and carrying out equal ratio splitting on all files subjected to cleaning and classifying to obtain n files to be ordered; according to the transaction time information, sorting n files to be sorted in the M data partitions to obtain n sorted files; based on a difference comparison algorithm, data comparison is carried out on the n ordered files.

Description

File data comparison method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing of financial science and technology (Fintech), and relates to, but is not limited to, a data comparison method and device of files, data comparison equipment of files and a computer readable storage medium.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changed to the financial technology (Fintech), however, the financial technology also has higher requirements on the technology due to the requirements of safety and real-time performance of the financial industry.
In the field of financial science and technology, the users and the transaction amount of the transaction products of the micro-mass banks are very large, and the check of whether the daily transaction of the users is correctly processed is a difficult problem facing hundreds of millions of stock users and daily transactions. For example, a change product at the micro-credit client initiates a monetary fund redemption transaction in real time, the user's share is processed in real time, the transaction processing records are stored in a database in a persistent manner, a corresponding reconciliation document is generated at the end of each day, a final settlement document is generated for the redemption transaction, and a final settlement document is generated for the redemption. The reconciliation document uses a special protocol format, each line records a transaction, billions of transactions are sent to the micro-letter financial system through the reconciliation document, the micro-letter financial system needs to check the contents of the reconciliation document with the data recorded by the user in real time, and the inconsistent data needs to be checked based on the contents of the reconciliation document.
In the related art, for the process of reconciliation, referring to the step implementation in fig. 1, first, directly reading a reconciliation file, and analyzing each line of content in the reconciliation file; secondly, obtaining transaction data in a plurality of key field matching databases through analysis; finally, several results of the matching are processed. In processing several results of the match, if there is no transaction record in the reconciliation file, the database is in the transaction record and the processing transaction needs to be deleted and rolled back. If the transaction record exists in the account checking file, the transaction record does not exist in the database, and the transaction needs to be newly added and processed. If the transaction record exists in the account checking file, the database stores the transaction record; in this case, there are two cases, one is that transaction data is inconsistent, the transaction is processed based on the checked-out document, and the other is that transaction data is consistent, the checking out is illustrated to be consistent, and the processing is not needed. Therefore, the related technology has the problems of low processing efficiency and long time consumption because the large file is directly analyzed and processed in the checking process.
Disclosure of Invention
The embodiment of the application provides a data comparison method and device for files, data comparison equipment for files and a computer readable storage medium, which are used for solving the problems of low processing efficiency and long time consumption of the prior art that at least large files are directly analyzed and processed in the checking process.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a data comparison method of files, which comprises the following steps:
performing equal-ratio splitting on the obtained account checking file to obtain N split files;
dividing the N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles;
according to the transaction type and the transaction time information, carrying out data cleaning and classifying on the m subfiles in each data partition, and carrying out equal-ratio splitting on all the cleaned and classified files to obtain n files to be ordered;
sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and based on a difference comparison algorithm, carrying out data comparison on the n ordered files.
A data comparison apparatus for a file, comprising:
the processing module is used for carrying out equal-ratio splitting on the obtained account checking files to obtain N split files;
the processing module is used for dividing the N split files into M data partitions according to the user identifications of the exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles;
The processing module is used for carrying out data cleaning and classifying on the m subfiles in each data partition according to the transaction type and the transaction time information, and carrying out equal-ratio splitting on all the files after cleaning and classifying to obtain n files to be ordered;
the processing module is used for sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and the account checking module is used for comparing the data of the n ordered files based on a difference comparison algorithm.
The embodiment of the application provides equipment, which comprises the following components:
a memory for storing executable instructions; and the processor is used for realizing the method when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute the method.
The embodiment of the application has the following beneficial effects:
the method comprises the steps of carrying out equal-ratio splitting on an obtained account checking file to obtain N split files; dividing N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles; according to the transaction type and the transaction time information, carrying out data cleaning and classifying on m subfiles in each data partition, and carrying out equal ratio splitting on all files subjected to cleaning and classifying to obtain n files to be ordered; according to the transaction time information, sorting n files to be sorted in the M data partitions to obtain n sorted files; based on a difference comparison algorithm, carrying out data comparison on n ordered files; that is, the method and the device realize large file fragmentation analysis processing by splitting account files, so that the processing performance is accelerated, and further, the files in the partitions are ordered, the accuracy of file processing is improved, and the phenomenon that processing failure is caused by directly processing unordered files with high probability is avoided.
Drawings
FIG. 1 is a schematic diagram of a reconciliation process in the related art;
FIG. 2 is a schematic diagram of an alternative architecture of a server according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of an alternative method for comparing data of files according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of file splitting according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a file splitting result provided by an embodiment of the present application;
FIG. 6 is a schematic overall flow chart of a method for comparing data of files according to an embodiment of the present application;
FIG. 7 is a schematic diagram of the results of data cleansing provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of the result of file numbering provided by an embodiment of the present application;
FIG. 9 is a schematic flow chart of ordering data in a file block according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a result of ordering data within a file block according to an embodiment of the present application;
FIG. 11 is a diagram illustrating the result of ordering data between two file blocks according to an embodiment of the present application;
FIG. 12 is a diagram illustrating the result of ordering data among three file blocks according to an embodiment of the present application;
FIG. 13 is a schematic diagram of data ordering between two differently numbered files provided by an embodiment of the present application;
FIG. 14 is a schematic diagram of a process for exporting files from a database according to an embodiment of the present application;
FIG. 15 is a schematic diagram of the results of exporting files from a database provided by an embodiment of the present application;
FIG. 16 is a flow chart of exporting files from a database according to an embodiment of the present application;
FIG. 17 is a schematic diagram of reconciliation files and database files in different partitions provided by an embodiment of the application;
FIG. 18 is a diagram illustrating a comparison of a reconciliation file with a database file provided by an embodiment of the application;
FIG. 19 is a diagram illustrating the results of a reconciliation file and database file deduplication retention difference file provided by an embodiment of the present application;
FIG. 20 is a flow chart of a duplicate removal retention difference file between a reconciliation file and a database file provided by an embodiment of the application;
FIG. 21 is a schematic flow chart of a deduplication file block by calculating a sha1 value according to an embodiment of the present application;
FIG. 22 is a schematic diagram of related information of key-value pairs of associated data according to an embodiment of the present application;
fig. 23 is a schematic diagram of a reconciliation process provided by an embodiment of the application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of this application belong. The terminology used in the embodiments of the application is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
The following describes an exemplary application of the data comparison device for files provided in the embodiments of the present application, where the data comparison device for files provided in the embodiments of the present application may be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, or any terminal having a screen display function, and may also be implemented as a server. In the following, an exemplary application when the data comparison device of the file is implemented as a server will be described.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 100 according to an embodiment of the present application, and the server 100 shown in fig. 2 includes: at least one processor 110, at least one network interface 120, a user interface 130, and a memory 150. The various components in server 100 are coupled together by bus system 140. It is understood that the bus system 140 is used to enable connected communications between these components. The bus system 140 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 140 in fig. 2.
The processor 110 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, which may be a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 130 includes one or more output devices 131, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The user interface 130 also includes one or more input devices 132, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 150 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 150 optionally includes one or more storage devices physically located remote from processor 110. Memory 150 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 150 described in embodiments of the present application is intended to comprise any suitable type of memory. In some embodiments, memory 150 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 151 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
network communication module 152 for reaching other computing devices via one or more (wired or wireless) network interfaces 120, exemplary network interfaces 120 include: bluetooth, wireless compatibility authentication (Wi-Fi), and universal serial bus (Universal Serial Bus, USB), etc.;
An input processing module 153 for detecting one or more user inputs or interactions from one of the one or more input devices 132 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a data comparing apparatus 154 of a file stored in a memory 150, where the data comparing apparatus 154 of the file may be a data comparing apparatus of a file in a server 100, and may be software in the form of a program and a plug-in, and includes the following software modules: processing module 1541, reconciliation module 1542, which are logical, and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.
In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the data alignment method of the files provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), DSP, programmable logic device (Programmable Logic Device, PLD), complex programmable logic device (Complex Programmable Logic Device, CPLD), field-programmable gate array (Field-Programmable Gate Array, FPGA) or other electronic component.
The data comparison method of the file provided by the embodiment of the present application will be described below in connection with exemplary applications and implementations of the server 100 provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is a schematic flow chart of an alternative method for comparing data of files provided in an embodiment of the present application, which will be described with reference to the steps shown in fig. 3,
step S201, carrying out equal-ratio splitting on the obtained account checking file to obtain N split files.
In the embodiment of the application, under the condition that the account checking file is acquired, dividing the account checking file, namely the large file, into subfiles according to the block equal ratio according to a large file segmentation analysis processing algorithm to obtain N split files, wherein the divided files are ended by line feed symbols. Here, the reconciliation file is split into subfiles, so that the advantage of parallel computing of the distributed system can be used, each subfile is processed, and the processing performance is quickened.
In other embodiments of the present application, referring to fig. 4, if the reconciliation file is smaller, for example, a file smaller than 10MB, the reconciliation file is not required to be split, and when the reconciliation is to be executed, the data comparison is performed by directly using the difference comparison algorithm provided by the present application. In general, the reconciliation file is a large file, for example, a file greater than 10MB, and N subfiles are obtained by performing equal-ratio splitting on the large file, and then waiting for the next processing, for example, the following file partitioning according to the client dimension, that is, data partitioning.
Step S202, dividing N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files.
Each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles. Here, the user identifier, such as an account number, assigned to the user by the server is associated with the partition number of the data partition, so that the N split files may be divided into M data partitions according to the client dimension.
In the embodiment of the application, when a user registers an account, a server generates a globally unique account identifier (Identity document, ID), and the account ID comprises a partition number to which the user belongs. For example, the 16-bit account ID of a user is 0010000000000001, the first three bits 001 are partition numbers, the last 13 bits are self-increment sequences under the current partition, and the account is needed for each subsequent transaction operation of the user.
The system is deployed on the server, and partitions are carried out according to clients, for example, 40 partitions exist, different clients register one partition of the 40 partitions according to a preset rule when registering accounts of the micro bank. Illustratively, referring to FIG. 5, an illustration of dividing a split file of N split files into three data partitions of M data partitions, each having partition numbers 001, 002 and 003, is shown in FIG. 5.
That is, the application performs equal ratio splitting on the obtained account checking file to obtain N split files, then reads and analyzes each split file in the N split files row by row, performs file partition according to the system partition to which the user account belongs, generates middle partition sub-files, and each partition generates some sub-file partition sets. Here, the data in the N files needs to be divided into M partitions, and each of the partitioned files under a partition is also stored in a certain size, here, 10MB is taken as an example. It should be noted that, the files generated by the file partition in this step are unordered, only one data split is performed according to the partition to which the user account belongs, after the partition, one file is written, if the file is larger than the set size value, a second file is written, until all the data are written into the file of the designated partition.
And step S203, carrying out data cleaning and classifying on m subfiles in each data partition according to the transaction type and the transaction time information, and carrying out equal-ratio splitting on all files subjected to cleaning and classifying to obtain n files to be ordered.
The application considers that the reconciliation file is generally unordered, the direct processing of the reconciliation file is failed with high probability, and the reconciliation file needs to be processed through two or more analyses, however, the processing sequence of the transaction type is required, therefore, the application cleans and classifies the M subfiles in each data partition of the M partitions according to the transaction type and the transaction time information, and equally divides all the cleaned and classified files to obtain n files to be ordered, wherein the data is cleaned and classified according to two factors of the transaction type and the transaction time information, thereby effectively improving the ordering efficiency.
Step S204, according to the transaction time information, sorting n files to be sorted in the M data partitions to obtain n sorted files.
In the embodiment of the application, under the condition that n files to be sorted in M data partitions are obtained by data cleaning and sorting, n files to be sorted in the M data partitions are sorted by taking transaction time information as sorting reference dimension, and each file after sorting of the n files to be sorted is also stored according to a certain size, wherein 2MB is taken as an example, so that the accuracy of file processing is improved by sorting the files in the partitions.
Step S205, based on a difference comparison algorithm, data comparison is performed on the n ordered files.
In one possible embodiment, referring to fig. 6, fig. 6 shows the overall flow of the data comparison method of the file of the present application, where the reconciliation file, i.e. the large file, is split into N subfiles at first; then, partitioning the N subfiles into M data partitions according to the dimension of the user, wherein each partition comprises M subfiles, then, cleaning the M subfiles of each partition according to the partition to obtain N subfiles of each partition, and finally, performing service logic data processing such as sequencing processing on each subfile of the N subfiles according to the partition, and further, performing data comparison on the processed data. The method provided by the application processes the large files in a split-then-sort mode, thereby improving file reading efficiency and account checking accuracy.
According to the data comparison method of the files, N split files are obtained by carrying out equal-ratio splitting on the obtained account checking files; dividing N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles; according to the transaction type and the transaction time information, carrying out data cleaning and classifying on m subfiles in each data partition, and carrying out equal ratio splitting on all files subjected to cleaning and classifying to obtain n files to be ordered; according to the transaction time information, sorting n files to be sorted in the M data partitions to obtain n sorted files; based on a difference comparison algorithm, carrying out data comparison on n ordered files; that is, the method and the device realize large file fragmentation analysis processing by splitting account files, so that the processing performance is accelerated, and further, the files in the partitions are ordered, the accuracy of file processing is improved, and the phenomenon that processing failure is caused by directly processing unordered files with high probability is avoided.
In some embodiments, step S203 performs data cleaning and classifying on m subfiles in each data partition according to the transaction type and the transaction time information, and performs equal-ratio splitting on all files after cleaning and classifying to obtain n files to be ordered, which may be implemented by the following steps:
A11, reading each subfile in the m subfiles, and traversing the transaction type and transaction time information of each row of data in each subfile.
And A12, processing all row data in each sub-file according to the cleaning and classifying condition that the j-th type of transaction is provided and the transaction time information is one hour, and obtaining all files after cleaning and classifying.
Wherein the transaction types include a type j transaction type. It should be noted that, the data in all files after the sorting is washed out is unordered. In the embodiment of the application, the data are ordered according to the transaction time.
In embodiments of the present application, transaction types include at least claims and redemption.
And reading each subfile according to each of the M data partitions, traversing each row, cleaning and classifying the data according to the transaction type and the transaction time range, such as one hour, and storing the data in different files. Illustratively, the individual data fields of each line of the document in the reconciliation document of the present application are separated by "|", some key fields are listed here in the following format: transaction flow number |user account number|transaction type|transaction date|transaction time|transaction amount|transaction share|remark.
Referring to fig. 7, for document processing in a certain partition, data of different transaction types is stored in a designated document according to a transaction time period, which is exemplified by one hour, wherein the transaction type field 0 is purported, 1 is redeemed, and a non-critical field is replaced by a. FIG. 7 shows the files after cleaning and categorizing the files in the above-mentioned partition, for example, including: the transaction data of the application 09 hours, namely the transaction data of which the transaction type is the application and the transaction time is 09 hours; redemption of 09 hours transaction data, i.e., transaction data with a transaction type of redemption and a transaction time of 09 hours; and redeeming 10 hours of transaction data, i.e., transaction data having a transaction type of redemption and a transaction time of 10 hours.
A13, carrying out equal-ratio splitting on all the files after cleaning and classifying to obtain W split files.
The W split files comprise W files which are to be ordered and have the j-th transaction type and the transaction time information is one hour, and the W files corresponding to all the transaction types form n files to be ordered.
Here, the file data within each partition is grouped into different files according to transaction type and hour range. After data cleaning and classifying, the file data in each partition are collected into different files according to the transaction type and the hour range. Some hours of transactions may have a relatively large data size, and the data is split to a second file store after the file has reached a size of 10MB according to the file splitting principle previously described. There may be w documents, i.e., many transaction documents, under the same hour document for the same transaction type.
In some embodiments, step S204 sorts the n files to be sorted in the M data partitions according to the transaction time information, to obtain n sorted files, which may be implemented by the following steps:
a21, numbering each file in each w files to obtain a plurality of files from the number 1 to the number w.
For w files of a certain transaction type for a certain hour, each file 10MB is exemplified, the w files of a certain hour are numbered 1 to w, as shown in fig. 8, w files are corresponding to the transaction type being purchased and the transaction time being 09 points, each file of the w files is numbered, and a plurality of files from the number 1 to the number w are obtained, including: 09 purchasing hour transaction data file 1, 09 purchasing hour transaction data file 2, 09 purchasing hour transaction data file 3 … … 09 purchasing hour transaction data file w. The data in all files is unordered at this point. In the present application, the file "{ transaction type } -transaction time period _ -file number" is named by the following naming method. For example, the file name of the purchased 09 hours is 0_09_000001, the file name of the redeemed 09 hours is 1_09_000001, and the number is incremented by an integer of 6 bits.
A22, according to a file memory mapping mode, for the files of the number i in the numbers 1 to w, reading file blocks with preset sizes in the files of the number i in parallel each time to obtain a plurality of file blocks with the same size of the number i.
In the embodiment of the application, in the ordering process, the files with different numbers in the w files are subjected to parallel ordering processing. Here, the description will be given by taking the order of the files with the number i, for example, the file with the number 1, and the files with other numbers are ordered in the same manner. For file size 10MB for number 1, every time a 2MB file block is read, the file number 1 would be equally divided into 5 block reads.
A23, reading file blocks k in a plurality of file blocks with the same size of the number i, and analyzing data of each row in the file blocks k in parallel to obtain transaction time information of the data of each row in the file blocks k.
For example, reading file block k in a plurality of file blocks with the same size of number 1, analyzing each line of data in file block k in parallel to obtain transaction time information of each line of data in file block k, reading first block 2MB data, analyzing each line of data line by line, and obtaining transaction time as a basis for sequencing.
And A24, if the (i+1) th row data in the file block k is read, comparing the (i+1) th row data with the previous (i) th row data, determining the target position of the (i+1) th row data in the file block k, and inserting the (i+1) th row data into the target position to obtain the ordered file block k.
The transaction time of the ith row of data positioned at the target position in the ordered file block k is after the transaction time of the ith row of data positioned at the previous adjacent position of the target position and is before the transaction time of the ith row of data positioned at the next adjacent position of the target position.
Here, for file block k, each line of data is read, a position that is greater than or equal to the preceding time and less than the following time is found compared with the preceding data of the line of data read in file block k, that line of data is inserted into that position, and the following data is shifted one line backward. And rewriting the first 2MB file block of the file with the number 1 into the ordered file blocks, thereby realizing the ordering of the file blocks k.
For example, referring to fig. 9, for file block 1, file block 1 contains 6 rows of data, and after reading the first row of data, only the first row of data is compared with the next row of data, and the position of the first row of data is unchanged since 090002 is smaller than next row 092005; after the second row of data is read, 092005 is longer than 090002, and compared with the data of the next row to be 092005 is still longer than 090102, which means that the positions of the row data corresponding to 092005 and the row data corresponding to 090102 should be exchanged, after the exchange, the third row of data is read, 092005 is longer than 090102 and smaller than 094002, the subsequent row of data is continuously read to be sorted until the time corresponding to each row of data in the file block 1 is longer than or equal to the previous time and is smaller than the later time, and sorting is completed for the file block 1, and the sorted file block 1 is shown in fig. 10.
In the embodiment of the present application, after sorting is performed for each file block in the file block k in the plurality of file blocks with the same size of the number i, a plurality of sorted file blocks are obtained, for example, 5 sorted file blocks are obtained for the file block 1.
For the file block 2, the file block 3, the file block 4 and the file block 5 with the number 1, the same sorting method as that of the file 1 is adopted for sorting, each file block k after sorting is rewritten with the file with the number i, and the internal sorting result is persisted to a disk, so that the sorting for each file block is realized.
A25, sorting all the file blocks with the number i after sorting based on a sorting mode of multi-row matching, and obtaining all the file blocks with the number i after sorting.
In the embodiment of the application, the file blocks obtained by the A24 are the file blocks after the respective ordering of the file blocks k, and further, the ordering is performed by using a plurality of rows of matching, so that the ordering among a plurality of file blocks with the number i is realized.
Here, for the sorting between file block 1 and file block 2, file block 1 and file block 2 are read, a position of file block 2 in which m consecutive lines (this m may be 1) are greater than or equal to the preceding time and less than the following time in file block 1 is found, m lines are inserted into this position, and at the same time the following lines are moved down from the first file block to the back, and the last m lines of file block 1 will be moved down to file block 2 by m lines, and m lines moved down to file block 2 are also compared in reverse, and are moved to the position of file block 2 corresponding to the sorting, so as to achieve the purpose of sorting of two file blocks.
For example, referring to fig. 11, file block 1 and file block 2 are read, a position where file block 2 is continuously 2 lines in file block 1 is greater than or equal to the previous time and less than the next time, i.e. a position where lines 4-5 in file block 1 are located, file block 2 is continuously 2 lines are inserted into this position, and simultaneously the next line is moved down from the first file block, and lines 7-8 after movement in file block 1 are moved down to file block 2 by m lines, and then moved down to line 2 of file block 2 are also compared in reverse, and a position where lines 7-8 after movement in file block 1 are greater than or equal to the previous time and less than the next time in file block 2 is found, e.g. line 7 after movement in file block 1 should be inserted into line 4 in file block 2, line 8 after movement in file block 1 should be inserted into line 6 in file block 2, and then moved down to a position corresponding ordering of file block 2 is placed in order to achieve the purposes of ordering two file blocks.
In this way, file block 1 and file block 2 are ordered in memory, and the ordered results are rewritten back to the first and second 2MB file blocks numbered 1.
Similarly, referring to fig. 12, the file block 3 is compared with the file block 1 and the file block 2 respectively, a suitable position is found, m rows in the file block 3 are moved to a suitable position of the file block 1 or the file block 2, corresponding to the m rows which are redundant after insertion are moved down to the file block 2 or the file block 3, if the m rows which are redundant after sorting are moved to the file block 2, the m rows which are redundant after sorting are continued to be moved down to the file block 3, and finally the aim of sorting three file blocks is achieved.
The file block 4 and the file block 5 are also processed in the same way, the file block 4 is respectively compared with the file block 1, the file block 2 and the file block 3 to select proper positions for insertion and sorting, and the file block 5 is respectively compared with the file block 1, the file block 2, the file block 3 and the file block 4 to select proper positions for insertion and sorting. Finally, the ordering among the file blocks numbered 1 is completed.
And A26, sorting the files from the number 1 to the number w based on a sorting mode of multi-row matching, and obtaining n sorted files.
In the embodiment of the application, the files from the number 2 to the number w are ordered by using the ordering mode of the files from the number 1, and the respective ordering of the files with different numbers can be processed in parallel and realized by using the technologies of multithreading, distributed clustering and the like, so that the w files are ordered independently.
Further, the method of sorting with the extension number 1 is to sort a plurality of file blocks, as shown in fig. 13 below, where p rows of data of the file block 1 with the file number 2 are compared with the file block 1 with the file number 1, if there is a proper position, p rows of data are inserted into the position, then all the file blocks with the file number 1 move down by p rows of data, move into the file number 2, and continue to find the proper position reversely for storing. And (3) repeatedly sequencing all data lines of all file blocks of the file number 1 and the file number 2.
Further, the same is done for file number 3 through file number n, and the sorting of the purchased 09 hours files is finally completed as compared with file number 1 and file number 2.
In some embodiments, step S205 performs data comparison on the n sorted files based on the difference comparison algorithm, which may be implemented by the following steps:
a31, according to the head line transaction time field and the tail line transaction field of all the file blocks after the sorting of the number i, the database file i is derived from the database.
Wherein, all file blocks with the number i after sequencing have the same data partition identification with the database file i.
In the embodiment of the application, during the checking process, data is exported from the database according to the same rules of the data cleaning, and subfiles are exported according to the framed data types and the transaction time ranges. The data is exported from the database by first reading the time ranges of the ordered files within the partition, for example, the transaction time fields of the first and last lines of the file, as a single file. Second, transaction data is derived from the range.
In the embodiment of the application, in the process of exporting the file from the database, the transaction type can be obtained according to the file name, so that the time range can be framed when the transaction record is exported from the database, and meanwhile, the database script can be used for direct sequencing.
Further, for exported files, the file name may be named by rule, with the database exported file naming rule prefixed by "db_" before the previous reconciliation file name. For example, the post-sorting reconciliation file name in the partition is "0_09_000001", then the database export file name is "db_0_09_000001"
In some embodiments, there are two cases of ordering the database export files, the first being that if there is only one database for a partition, then the database export files are ordered in the step according to the rules. Second, if a plurality of databases are used in one partition, there are a plurality of database export files corresponding to one reconciliation file, and as an example, referring to fig. 14, three databases are taken as an example, namely, the reconciliation file names are "db1_0_09_000001", "db2_0_09_000001", and "db3_0_09_000001" corresponding to the database export file. At this time, the data lines of the three files are sorted and combined into one file. The file ordering algorithm used previously is used here.
Illustratively, referring to fig. 15 and 16, the database file 1 is derived from the database according to the head line transaction time field 090002 and the tail line transaction time field 095716 of all file blocks numbered 1, and the file name of the derived database file 1 is db_09_000001. Here, the transaction type may also be determined before the export, and if the transaction type does not match, the export is stopped.
A32, calculating first hash values of all file blocks after sorting of the number i based on a difference comparison algorithm.
In an embodiment of the present application, the difference comparison algorithm includes, but is not limited to, a message digest algorithm md5.
A33, calculating a second hash value of the database file i based on the difference comparison algorithm.
In the embodiment of the present application, a difference comparison algorithm based on a message digest algorithm is used, for each partition shown in fig. 17, files corresponding to each other in each partition are compared and differences are screened, and referring to fig. 18, a value H is calculated for each file by using a message digest algorithm md5, and if the calculated H of the reconciliation file 0_09_000001 and the database export file db_0_09_000001 are the same, it is stated that the reconciliation is consistent, and the reconciliation is not required to be processed and is directly excluded.
If the comparison of H calculated in the reconciliation file 0_09_000001 and the database export file db_0_09_000001 is not the same, it is necessary to continue the next process.
And A34, if the first hash value is different from the second hash value, determining that all the file blocks with the number i after sequencing are different from the database file i.
A35, removing file blocks which are the same as all file blocks in the sequence of the number i from the database file i based on a data matching algorithm, and screening out a first difference file block in the database file i and a second difference file block in all file blocks in the sequence of the number i.
Here, after the comparison of the document md5 values, the completely identical document is deduplicated, and the remainder are documents that differ between the reconciliation document and the database export document. Within a differenced file, because the blocks of possibly most consecutive lines are ordered and equal, the same file blocks can be removed using file block data matching, leaving a differenced portion.
In the embodiment of the application, after the comparison of the md5 value of the file, the completely consistent file is deduplicated, and the rest is the file with difference between the reconciliation file and the database export file.
A36, determining difference information between the first difference file block in the database file i and the second difference file block in all the file blocks with the number i after sequencing, and carrying out data comparison based on the difference information.
In some embodiments, based on the data matching algorithm, a35 removes the same file blocks from the database file i as all the file blocks with the number i after sorting, and screens out the first difference file block from the database file i and the second difference file block from all the file blocks with the number i after sorting, which can be achieved by the following steps:
And A351, removing the head line data and the tail line data in the all file blocks with the number i after sequencing and the database file i at least once if the number of the all file blocks with the number i after sequencing is different from the line number of the data contained in the database file i, and obtaining all file blocks with the number i after sequencing and the database file i after de-sequencing.
Wherein, all file blocks after the line removal of the number i and file blocks after the line removal of the database file i have the same transaction period.
A352, calculating third hash values of all file blocks after the line removal of the number i based on a difference comparison algorithm.
A353, calculating a fourth hash value of the de-lined file block of the database file i based on the difference comparison algorithm.
A354, if the third hash value is different from the fourth hash value, determining that all file blocks after the line removal of the number i and the file blocks after the line removal of the database file i are different.
And A355, removing the same file blocks as all the file blocks after the line removal of the number i from the file blocks after the line removal of the database file i based on a data matching algorithm, and screening out a third difference file block in the file blocks after the line removal of the database file i and a fourth difference file block in all the file blocks after the line removal of the number i.
A356, removing tail line data of the fourth difference file block with the number i and the third difference file block of the database file i at least once, and screening out the first difference file block in the database file i and the second difference file block with the number i; wherein the first difference file block in the database file i and the second difference file block with the number i have the same tail line transaction time.
Referring to fig. 19, 20 and 21, 1), taking as an example the reconciliation file 0_09_0000002 and the database export file db_0_09_000002, there may be generally three cases where the number of rows of the two files are compared first:
in the first case, the number of lines of the 0_09_0000002 file is greater than db_0_09_000002.
In the second case, the number of lines of the 0_09_0000002 file is equal to db_0_09_000002.
In the third case, the number of lines of the 0_09_0000002 file is smaller than db_0_09_000002.
2) Continuously comparing the first lines of the two files, and removing the line with smaller first line transaction time in the two files
3) Then comparing the tail lines of the two files, and removing the line with larger tail line transaction time in the two files
4) After the operations of removing the first line and the last line for a plurality of times, until the first line transaction time and the last line transaction time of two files are equal, and the line numbers of the two file data are equal, the message digest algorithm sha1 is used for taking and comparing sha1 of the two files.
5) If the two sha1 values are equal, the equal file blocks in the two files are removed, and the row of the two files, from which the call is excluded, is left. To step 7
6) If the two sha1 values are not equal, the tail lines of the two files are removed at the same time, the transaction time of the tail lines is continuously compared, the scene that the transaction time of the tail lines of the two files is equal and the line numbers are equal is waited again, and the step 5 is returned to
7) Looping through steps 1) through 6) until there are no identical file blocks.
Further, a number of association arrays may be used to store sha1 values and data rows of two difference files, respectively.
In some embodiments, the determining, by a36, the difference information between the first difference file block in the database file i and the second difference file block in all the file blocks after the sorting of the number i, and performing the data comparison based on the difference information may be implemented by the following steps:
and A361, calculating a fifth hash value of the first difference file block in the database file i based on a difference comparison algorithm, and recording the fifth hash value as a key and a data row of the first difference file block in the database file i as a value as a first associated array of the database file i.
A362, calculating a sixth hash value of the second difference file block with the number i based on the difference comparison algorithm, and recording the sixth hash value as a key and the data line of the second difference file block with the number i as a value as a second associated array with the number i.
A363, in each partition, comparing the key of the first association array of the database file i with the key of the second association array of the number i, and removing the data rows with the same keys in the two association arrays to obtain a third association array of the database file i and a fourth association array of the number i.
The key in the third associated array of the database file i is the transaction serial number of each row of data and the value is a data row; the key in the fourth associated array numbered i is the transaction stream number for each row of data and the value is the row of data.
Illustratively, as shown in FIG. 22, a transaction stream number is obtained with keys for each row of data and a value is an associated array of rows of data.
In the embodiment of the application, the second association array of the number i is represented by an association array A, the first association array of the database file i is represented by an association array B, the fourth association array of the number i is represented by an association array C, and the third association array of the database file i is represented by an association array D.
A364, determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and comparing data based on the difference information.
In some embodiments, a364 determines the difference information between the third associated array of database file i and the fourth associated array of number i, and performs the data comparison based on the difference information, which may be implemented by the following steps:
A3641, if a first key which does not exist in the third association array of the database file i exists in the fourth association array of the number i, determining that the difference information represents whether the database exists in the reconciliation file, determining a data row corresponding to the fourth association array of the number i based on the first key, and adding a data row corresponding to the fourth association array of the number i into the third association array of the database file i.
A3642, if the second key existing in the third association array of the database file i does not exist in the fourth association array of the number i, determining that the difference information represents that the reconciliation file does not have data of the database, determining a data row corresponding to the third association array of the database file i based on the second key, and deleting the data row corresponding to the third association array of the database file i.
A3643, if a third key exists in a third association array of the database file i in a fourth association array of the number i, determining that transaction data exists in both the accounting file and the database by the difference information representation, and the transaction data is inconsistent, and replacing a data row corresponding to the third key in the fourth association array of the number i with a data row corresponding to the third key in the fourth association array of the number i.
In one embodiment, referring to fig. 23, the implementation of reconciliation of the present application is further described with respect to an association array a, an association array B, an association array C, and an association array D, where after all files are processed, only two distinct association arrays remain for each partition after the reconciliation file and database file are compared. And marking the difference association array generated by the reconciliation file as an association array A, and marking the difference association array generated by the database file as an association array B. To this point, the data is already small, but there is a relatively small chance that the same file row may exist.
Within each partition, two associated arrays are traversed separately. And comparing the keys of all the associated arrays A with the keys of all the associated arrays B, and excluding the data with the same keys in the two associated arrays. And remapping the residual data in the two associated arrays A and B into a new associated array C and a new associated array D, and taking a unique transaction serial number of each data as a key and a data line as a value.
Comparing the keys of all the associated arrays C with the keys of all the associated arrays D, there are three cases:
first, the associated array C exists, the associated array D does not exist, which indicates that the account file has data of the database or not, the data row corresponding to the associated array C needs to be re-associated with the key, and the data row corresponding to the associated array C is added to the associated array D, so that the new transaction serial number is added in the associated array D, and the transaction data is added for the new transaction serial number.
Secondly, the associated array C does not exist, the key exists in the associated array D, the fact that the account checking file has no data in the database is indicated, the data row corresponding to the associated array D needs to be associated again by using the key, and the data row corresponding to the associated array D is deleted, so that the deletion of redundant and incorrect transaction information in the database is realized.
Thirdly, the associated array C exists, the key also exists in the associated array D, the data existing in the reconciliation file and the database are described, but transaction data are inconsistent (because the algorithm in front of the consistent data has been removed), the data row corresponding to the key associated array C needs to be used, and the data row corresponding to the key associated array D is replaced by the data row corresponding to the key associated array C, so that the data in the database is ensured to be consistent with the data in the reconciliation file.
As can be seen from the above, the file and data difference comparison algorithm of the application adopts the sequencing in the large file fragment analysis processing algorithm and the partition design, thereby facilitating the parallel calculation by using a distributed system and having higher file block data matching efficiency; meanwhile, the correctness of file processing is high, and the probability of file processing failure is greatly reduced.
Continuing with the description below of an exemplary configuration in which the data alignment device 154 for files provided in embodiments of the present application is implemented as a software module, in some embodiments, as shown in fig. 2, the software module in the data alignment device 154 for files stored in the memory 150 may be a data alignment device for files in the server 100, including:
The processing module 1541 is configured to perform equal-ratio splitting on the obtained reconciliation file to obtain N split files;
the processing module 1541 is configured to divide the N split files into M data partitions according to the user identifier of the transaction associated with the reconciliation file; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles;
the processing module 1541 is configured to perform data cleaning and classifying on m subfiles in each data partition according to the transaction type and the transaction time information, and perform equal-ratio splitting on all the cleaned and classified files to obtain n files to be ordered;
the processing module 1541 is configured to sort n files to be sorted in the M data partitions according to the transaction time information, so as to obtain n sorted files;
the reconciliation module 1542 is configured to perform data comparison on the n sorted files based on a difference comparison algorithm.
In some embodiments, the processing module 1541 is configured to read each of the m subfiles, traverse the transaction type and transaction time information of each row of data in each subfile; processing all line data in each sub-file according to cleaning and classifying conditions with j-th transaction types and transaction time information of one hour to obtain all files after cleaning and classifying; wherein the transaction types include a j-th type of transaction; performing equal-ratio splitting on all the files after cleaning and classifying to obtain W split files; the W split files comprise W files which are to be ordered and have the j-th transaction type and the transaction time information is one hour, and the W files corresponding to all the transaction types form n files to be ordered.
In some embodiments, the processing module 1541 is configured to number each of the w files, to obtain a plurality of files numbered 1 through w; according to a file memory mapping mode, aiming at a file with a number i in numbers 1 to w, parallelly reading file blocks with preset sizes in the file with the number i each time to obtain a plurality of file blocks with the same size of the number i; reading file blocks k in a plurality of file blocks with the same size of the number i, and analyzing data of each row in the file blocks k in parallel to obtain transaction time information of the data of each row in the file blocks k; if the (i+1) th row data in the file block k is read, comparing the (i+1) th row data with the previous (i) th row data, determining a target position of the (i+1) th row data in the file block k, and inserting the (i+1) th row data into the target position to obtain the ordered file block k; the transaction time of the (i+1) th row of data positioned at the target position in the ordered file block k is after the transaction time of the (i) th row of data positioned at the previous adjacent position of the target position and before the transaction time of the (i+2) th row of data positioned at the next adjacent position of the target position; sorting all file blocks with the number i after sorting based on a sorting mode of multi-row matching to obtain all file blocks with the number i after sorting; and sorting the files from the number 1 to the number w based on a sorting mode of multi-row matching to obtain n sorted files.
In some embodiments, the reconciliation module 1542 is configured to derive the database file i from the database according to the head line transaction time field and the tail line transaction field of all file blocks ordered by the number i; all the file blocks with the number i after sequencing have the same data partition identification with the database file i; calculating first hash values of all file blocks after sequencing of the number i based on a difference comparison algorithm; calculating a second hash value of the database file i based on a difference comparison algorithm; if the first hash value is different from the second hash value, determining that all the file blocks with the number i after sequencing are different from the database file i; removing file blocks which are the same as all file blocks in the sequence of the number i from the database file i based on a data matching algorithm, and screening out a first difference file block in the database file i and a second difference file block in all file blocks in the sequence of the number i; and determining difference information between the first difference file block in the database file i and the second difference file blocks in all the file blocks with the number i after sequencing, and carrying out data comparison based on the difference information.
In some embodiments, the reconciliation module 1542 is configured to remove, at least once, the first line data and the last line data in the sorted all file blocks of the number i and the database file i if the number of the sorted all file blocks of the number i is different from the line number of the data included in the database file i, to obtain the de-sorted all file blocks of the number i and the de-sorted file blocks of the database file i; all file blocks after the line removal of the number i and the file blocks after the line removal of the database file i have the same transaction period; calculating third hash values of all file blocks after the line removal of the number i on the basis of a difference comparison algorithm; calculating a fourth hash value of the file block of the database file i after the line removal based on a difference comparison algorithm; if the third hash value is different from the fourth hash value, determining that all file blocks after the line removal of the number i and file blocks after the line removal of the database file i are different; removing the same file blocks as all the file blocks after the line removal of the number i from the file blocks after the line removal of the database file i based on a data matching algorithm, and screening out a third difference file block in the file blocks after the line removal of the database file i and a fourth difference file block in all the file blocks after the line removal of the number i; removing tail line data of a fourth difference file block with a number i and a third difference file block of the database file i at least once, and screening out a first difference file block in the database file i and a second difference file block with the number i; wherein the first difference file block in the database file i and the second difference file block with the number i have the same tail line transaction time.
In some embodiments, reconciliation module 1542 is configured to calculate a fifth hash value of the first difference file block in database file i based on the difference comparison algorithm and record the fifth hash value as a key and a data row of the first difference file block in database file i as a value as a first association array of database file i; calculating a sixth hash value of the second difference file block with the number i based on a difference comparison algorithm, and recording a data row of the second difference file block with the number i as a second associated array with the number i by taking the sixth hash value as a key; in each partition, comparing the key of the first association array of the database file i with the key of the second association array of the number i, and removing the data rows with the same keys in the two association arrays to obtain a third association array of the database file i and a fourth association array of the number i; the key in the third associated array of the database file i is the transaction serial number of each row of data and the value is a data row; the key in the fourth associated array of the number i is the transaction serial number of each row of data and the value is the data row; and determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and carrying out data comparison based on the difference information.
In some embodiments, the reconciliation module 1542 is configured to determine that the difference information characterizes whether the reconciliation file has data of the database if a first key not existing in the third association array of the database file i exists in the fourth association array of the number i, determine a data row corresponding to the fourth association array of the number i based on the first key, and add a data row corresponding to the fourth association array of the number i to the third association array of the database file i; if a second key existing in a third association array of the database file i does not exist in a fourth association array of the number i, determining that the difference information represents that the reconciliation file does not have data of the database, determining a data row corresponding to the third association array of the database file i based on the second key, and deleting the data row corresponding to the third association array of the database file i; if a third key exists in a third association array of the database file i in a fourth association array of the number i, determining that transaction data exists in both the difference information representation account file and the database, and the transaction data is inconsistent, and replacing a data row corresponding to the third key in the fourth association array of the number i with a data row corresponding to the third key in the fourth association array of the number i.
According to the data comparison device for the files, N split files are obtained by carrying out equal-ratio splitting on the obtained account checking files; dividing N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles; according to the transaction type and the transaction time information, carrying out data cleaning and classifying on m subfiles in each data partition, and carrying out equal ratio splitting on all files subjected to cleaning and classifying to obtain n files to be ordered; according to the transaction time information, sorting n files to be sorted in the M data partitions to obtain n sorted files; based on a difference comparison algorithm, carrying out data comparison on n ordered files; that is, the method and the device realize large file fragmentation analysis processing by splitting account files, so that the processing performance is accelerated, and further, the files in the partitions are ordered, the accuracy of file processing is improved, and the phenomenon that processing failure is caused by directly processing unordered files with high probability is avoided.
It should be noted that, the description of the apparatus according to the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. For technical details not disclosed in the present apparatus embodiment, please refer to the description of the method embodiment of the present application for understanding.
Embodiments of the present application provide a storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform the method provided by the embodiments of the present application.
In some embodiments, the storage medium may be a computer readable storage medium, such as a ferroelectric Memory (FRAM, ferromagnetic Random Access Memory), read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read Only Memory), flash Memory, magnetic surface Memory, optical Disk, or Compact Disk-Read Only Memory (CD-ROM), or the like; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (hypertext markup language ) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (9)

1. A method of data comparison of a document, comprising:
performing equal-ratio splitting on the obtained account checking file to obtain N split files;
Dividing the N split files into M data partitions according to user identifications of exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles;
according to the transaction type and the transaction time information, carrying out data cleaning and classifying on the m subfiles in each data partition, carrying out equal-ratio splitting on all files after cleaning and classifying, and obtaining n files to be ordered comprises the following steps: reading each subfile in the m subfiles, and traversing the transaction type and the transaction time information of each row of data in each subfile; processing all row data in each sub-file according to a cleaning and classifying condition with a j-th transaction type and the transaction time information of one hour to obtain all files after cleaning and classifying; wherein the transaction type includes the j-th type of transaction; and carrying out equal-ratio splitting on all the files after cleaning and classifying to obtain W split files; the W split files comprise W files which are to be ordered and have j-th transaction types, the transaction time information is one hour, and the W files corresponding to all the transaction types form n files to be ordered;
Sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and based on a difference comparison algorithm, carrying out data comparison on the n ordered files.
2. The method of claim 1, wherein the sorting the n files to be sorted in the M data partitions according to the transaction time information, to obtain n sorted files, includes:
numbering each file in the w files to obtain a plurality of files from the number 1 to the number w;
according to a file memory mapping mode, for the files of the number i in the numbers 1 to w, parallelly reading file blocks with preset sizes in the files of the number i each time to obtain a plurality of file blocks with the same size of the number i;
reading a file block k in a plurality of file blocks with the same size of the number i, and analyzing each row of data in the file block k in parallel to obtain transaction time information of each row of data in the file block k;
if the (i+1) th row data in the file block k is read, comparing the (i+1) th row data with the previous (i) th row data, determining a target position of the (i+1) th row data in the file block k, and inserting the (i+1) th row data into the target position to obtain the ordered file block k; the transaction time of the i+1th row data located at the target position in the ordered file block k is after the transaction time of the i row data located at the position adjacent to the target position before the transaction time of the i+2th row data located at the position adjacent to the target position;
Based on a multi-row matching ordering mode, ordering all the ordered file blocks with the number i to obtain all the ordered file blocks with the number i;
and ordering the files from the number 1 to the number w based on a multi-row matching ordering mode to obtain the n ordered files.
3. The method according to claim 2, wherein the data comparison of the n sorted files based on the difference comparison algorithm comprises:
according to the head line transaction time field and the tail line transaction field of all the file blocks after the sorting of the number i, a database file i is derived from a database; wherein, all file blocks after the sequencing of the number i have the same data partition identification with the database file i;
calculating first hash values of all file blocks after sequencing of the number i based on the difference comparison algorithm;
calculating a second hash value of the database file i based on the difference comparison algorithm;
if the first hash value is different from the second hash value, determining that all the file blocks with the number i after sequencing are different from the database file i;
Removing file blocks which are the same as all file blocks in the sequence of the number i from the database file i based on a data matching algorithm, and screening out a first difference file block in the database file i and a second difference file block in all file blocks in the sequence of the number i;
and determining difference information between the first difference file block in the database file i and the second difference file block in all the file blocks with the number i after sequencing, and carrying out data comparison based on the difference information.
4. A method according to claim 3, wherein the step of removing the same file blocks as the file blocks in the number i of the sorted file blocks from the database file i based on the data matching algorithm, and selecting the first difference file block in the database file i and the second difference file block in the number i of the sorted file blocks includes:
if the number i of the ordered file blocks is different from the number i of the rows of the data contained in the database file i, removing at least once the number i of the ordered file blocks and the first row data and the last row data in the database file i to obtain the number i of the ordered file blocks and the database file i of the line removed file blocks; wherein, all file blocks after the line removal of the number i and the file blocks after the line removal of the database file i have the same transaction period;
Calculating third hash values of all file blocks after the line removal of the number i on the basis of the difference comparison algorithm;
calculating a fourth hash value of the file block of the database file i after the line removal based on the difference comparison algorithm;
if the third hash value is different from the fourth hash value, determining that all file blocks after the line removal of the number i are different from the file blocks after the line removal of the database file i;
removing file blocks which are the same as all file blocks after the line removal of the number i from the file blocks after the line removal of the database file i based on the data matching algorithm, and screening out a third difference file block in the file blocks after the line removal of the database file i and a fourth difference file block in all file blocks after the line removal of the number i;
removing tail line data of the fourth difference file block of the number i and the third difference file block of the database file i at least once, and screening out the first difference file block and the second difference file block of the number i in the database file i; wherein the first difference file block in the database file i and the second difference file block of the number i have the same tail line transaction time.
5. The method according to claim 3 or 4, wherein determining difference information between a first difference file block in the database file i and a second difference file block in all file blocks of the number i after sorting, and performing data comparison based on the difference information, comprises:
calculating a fifth hash value of a first difference file block in the database file i based on the difference comparison algorithm, and recording the fifth hash value as a key and a data row of the first difference file block in the database file i as a value as a first association array of the database file i;
calculating a sixth hash value of the second difference file block of the number i based on the difference comparison algorithm, and recording the sixth hash value as a key and a data row of the second difference file block of the number i as a value as a second associated array of the number i;
in each partition, comparing the key of the first association array of the database file i with the key of the second association array of the number i, and removing the data rows with the same keys in the two association arrays to obtain a third association array of the database file i and a fourth association array of the number i; wherein, the key in the third association array of the database file i is the transaction serial number of each row of data and the value is a data row; the key in the fourth associated array of the number i is the transaction serial number of each row of data and the value is a data row;
And determining difference information between the third associated array of the database file i and the fourth associated array of the number i, and comparing data based on the difference information.
6. The method of claim 5, wherein determining difference information between the third associated array of the database file i and the fourth associated array of the number i and performing data comparison based on the difference information comprises:
if a first key which does not exist in a third association array of the database file i exists in a fourth association array of the number i, determining that the difference information characterizes that the reconciliation file has data which the database does not exist, determining a data row corresponding to the fourth association array of the number i based on the first key, and adding the data row corresponding to the fourth association array of the number i in the third association array of the database file i;
if a second key existing in a third association array of the database file i does not exist in a fourth association array of the number i, determining that the difference information characterizes that the reconciliation file does not have data of the database, determining a data row corresponding to the third association array of the database file i based on the second key, and deleting the data row corresponding to the third association array of the database file i;
If a third key existing in a third association array of the database file i exists in a fourth association array of the number i, determining that transaction data exists in both the reconciliation file and the database by the difference information, wherein the transaction data is inconsistent, and replacing a data row corresponding to the third key in the fourth association array of the number i with a data row corresponding to the third key in the fourth association array of the number i.
7. A data comparison apparatus for a file, comprising:
the processing module is used for carrying out equal-ratio splitting on the obtained account checking files to obtain N split files;
the processing module is used for dividing the N split files into M data partitions according to the user identifications of the exchanges associated with the reconciliation files; each data partition in the M data partitions corresponds to a user identifier, and each data partition comprises M subfiles;
the processing module is used for reading each subfile in the m subfiles and traversing the transaction type and the transaction time information of each line of data in each subfile; processing all row data in each sub-file according to a cleaning and classifying condition with a j-th transaction type and the transaction time information of one hour to obtain all files after cleaning and classifying; wherein the transaction type includes the j-th type of transaction; and carrying out equal-ratio splitting on all the files after cleaning and classifying to obtain W split files; the W split files comprise W files which are to be ordered and have j-th transaction types, the transaction time information is one hour, and the W files corresponding to all the transaction types form n files to be ordered;
The processing module is used for sorting the n files to be sorted in the M data partitions according to the transaction time information to obtain n sorted files;
and the account checking module is used for comparing the data of the n ordered files based on a difference comparison algorithm.
8. A data comparison apparatus of a file, comprising:
a memory for storing executable instructions; a processor for implementing the method of any one of claims 1 to 6 when executing executable instructions stored in said memory.
9. A computer readable storage medium storing executable instructions for causing a processor to perform the method of any one of claims 1 to 6.
CN202110724780.4A 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium Active CN113342750B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110724780.4A CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium
PCT/CN2021/140732 WO2023273235A1 (en) 2021-06-29 2021-12-23 Data comparison method, apparatus and device for file, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110724780.4A CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113342750A CN113342750A (en) 2021-09-03
CN113342750B true CN113342750B (en) 2023-11-17

Family

ID=77481343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110724780.4A Active CN113342750B (en) 2021-06-29 2021-06-29 File data comparison method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113342750B (en)
WO (1) WO2023273235A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342750B (en) * 2021-06-29 2023-11-17 深圳前海微众银行股份有限公司 File data comparison method, device, equipment and storage medium
CN113837878B (en) * 2021-09-07 2024-05-03 中国银联股份有限公司 Data comparison method, device, equipment and storage medium
CN113656654B (en) * 2021-10-19 2022-05-10 云丁网络技术(北京)有限公司 Method, device and system for adding equipment
CN113886332B (en) * 2021-12-09 2022-02-08 广东睿江云计算股份有限公司 Large file difference comparison method and device, computer equipment and storage medium
CN114363321A (en) * 2021-12-30 2022-04-15 支付宝(杭州)信息技术有限公司 File transmission method, equipment and system
CN116702024B (en) * 2023-05-16 2024-05-28 见知数据科技(上海)有限公司 Method, device, computer equipment and storage medium for identifying type of stream data
CN116308850B (en) * 2023-05-19 2023-09-05 深圳市四格互联信息技术有限公司 Account checking method, account checking system, account checking server and storage medium
CN116910631B (en) * 2023-09-14 2024-01-05 深圳市智慧城市科技发展集团有限公司 Array comparison method, device, electronic equipment and readable storage medium
CN117762873B (en) * 2023-12-20 2024-09-06 中邮消费金融有限公司 Data processing method, device, equipment and storage medium
CN118394849B (en) * 2024-06-26 2024-09-20 杭州古珀医疗科技有限公司 Method and device for comparing difference of full-scale data in medical field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586019B1 (en) * 2014-01-24 2020-03-10 The Pnc Financial Services Group, Inc. Automated healthcare cash account reconciliation method
CN111325617A (en) * 2020-01-22 2020-06-23 北京开科唯识技术有限公司 File-based account checking method and device, computer equipment and readable storage medium
CN112037003A (en) * 2020-09-17 2020-12-04 中国银行股份有限公司 File account checking processing method and device
CN112613964A (en) * 2020-12-25 2021-04-06 深圳鼎盛电脑科技有限公司 Account checking method, account checking device, account checking equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342750B (en) * 2021-06-29 2023-11-17 深圳前海微众银行股份有限公司 File data comparison method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10586019B1 (en) * 2014-01-24 2020-03-10 The Pnc Financial Services Group, Inc. Automated healthcare cash account reconciliation method
CN111325617A (en) * 2020-01-22 2020-06-23 北京开科唯识技术有限公司 File-based account checking method and device, computer equipment and readable storage medium
CN112037003A (en) * 2020-09-17 2020-12-04 中国银行股份有限公司 File account checking processing method and device
CN112613964A (en) * 2020-12-25 2021-04-06 深圳鼎盛电脑科技有限公司 Account checking method, account checking device, account checking equipment and storage medium

Also Published As

Publication number Publication date
WO2023273235A1 (en) 2023-01-05
CN113342750A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN113342750B (en) File data comparison method, device, equipment and storage medium
CN101553813B (en) Managing storage of individually accessible data units
CN107704539B (en) Method and device for large-scale text information batch structuring
CN112597153B (en) Block chain-based data storage method, device and storage medium
CN111258966A (en) Data deduplication method, device, equipment and storage medium
CN113590606B (en) Bloom filter-based large data volume secret key duplication eliminating method and system
CN112597284B (en) Company name matching method and device, computer equipment and storage medium
CN111815432A (en) Financial service risk prediction method and device
EP3955256A1 (en) Non-redundant gene clustering method and system, and electronic device
CN110021345B (en) Spark platform-based gene data analysis method
US10929441B1 (en) System and techniques for data record merging
CN113297224A (en) Mass data classification storage method and system based on Redis
CN112162922A (en) Method, device, server and storage medium for determining difference of new and old systems
CN115687599B (en) Service data processing method and device, electronic equipment and storage medium
CN115809228A (en) Data comparison method and device, storage medium and electronic equipment
CN113342819B (en) Card number generation method, device, equipment and storage medium
CN116010345A (en) Method, device and equipment for realizing table service scheme of flow batch integrated data lake
CN107403076B (en) Method and apparatus for treating DNA sequence
CN113590594B (en) Bank database migration method and device
CN113763166A (en) Data checking method and device
US7996366B1 (en) Method and system for identifying stale directories
CN103678117A (en) Data transition tracing apparatus and data transition tracing method
JPH08221254A (en) Method and device for merging sort
CN109542900B (en) Data processing method and device
CN118260258A (en) Text deduplication processing method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant