CN113283973A - Account checking difference data processing method and device, computer equipment and storage medium - Google Patents

Account checking difference data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113283973A
CN113283973A CN202110512158.7A CN202110512158A CN113283973A CN 113283973 A CN113283973 A CN 113283973A CN 202110512158 A CN202110512158 A CN 202110512158A CN 113283973 A CN113283973 A CN 113283973A
Authority
CN
China
Prior art keywords
data
difference data
difference
value
fingerprint value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110512158.7A
Other languages
Chinese (zh)
Inventor
任道亮
洪瑞哲
谌鸿雪
鲍贤武
王禄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Suning Software Technology Co ltd
Original Assignee
Nanjing Suning Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Suning Software Technology Co ltd filed Critical Nanjing Suning Software Technology Co ltd
Priority to CN202110512158.7A priority Critical patent/CN113283973A/en
Publication of CN113283973A publication Critical patent/CN113283973A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Software Systems (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for processing account checking difference data, computer equipment and a storage medium, wherein the method comprises the following steps: identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table; calculating the similarity of the first difference data and the second difference data; when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data; determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table; and obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields. According to the method, the difference reason of the difference data independent of manual experience can be obtained through automatic analysis of the difference data, and the accuracy of the difference reason of the difference data in the account checking is improved.

Description

Account checking difference data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing account checking difference data, a computer device, and a storage medium.
Background
In the financial accounting activities of enterprises, account checking is an indispensable important link. The account checking can find abnormal data and audit related risks in time in the financial accounting checking process. How to find out the difference data which can not be matched by the account checking parties under the condition of large data volume is provided, the industry has various implementation schemes with various characteristics. However, it is not enough to find the difference data, and the reason for the difference data is analyzed, so as to find the problem of the business process or the information system better. However, for the analysis of the cause of the difference data, the following methods or combinations thereof are available in the current processing mode:
1. purely manual analysis
The method refers to that business and financial staff browse difference data of all parties, find clues, combine data sources and business processes thereof and the like, and comprehensively analyze and investigate reasons for difference generation. Obviously, the method is time-consuming and labor-consuming, can only be applied to small data size, and is often left alone in the face of different data related personnel, and the actual effect is very dependent on the experience abundance of the related personnel.
2. Preset classification scenario
The method comprises the steps of summarizing and evaluating a plurality of specific scenes in advance according to experience of financial staff in a long-term business process, using the specific scenes as reasons for possibly generating various differences, and searching difference data which are in line with the scenes in the difference data according to preset scene characteristics. In general, the method "assumes" several causes of discrepancy in advance, and retrieves the matching discrepancy data, classifying it as such. It can be seen that this method also relies on the experience of the relevant personnel, is not actually responsible for the data analysis, and cannot cope with new different cause scenarios that may arise.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for processing account checking difference data, which can obtain a difference reason of the difference data independent of manual experience through automatic analysis of the difference data, and improve accuracy of the difference reason of the difference data in account checking.
A processing method of account checking difference data comprises the following steps: identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table; calculating the similarity of the first difference data and the second difference data; when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data; determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table; and obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields.
In one embodiment, calculating the similarity between the first difference data and the second difference data includes: acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing second difference data by adopting the SimHash algorithm; calculating a hamming distance between the first fingerprint value and the second fingerprint value; and obtaining the similarity according to the Hamming distance.
In one embodiment, obtaining the first fingerprint value and the second fingerprint value comprises: performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value; modifying the value 0 in the third fingerprint value into a value-1, and multiplying each modified value in the third fingerprint value by the corresponding weight to obtain a first weight fingerprint value; accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value; carrying out binarization processing on the first accumulated value to obtain a first fingerprint value; performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value; modifying the value 0 in the fourth fingerprint value into a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value; accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value; and carrying out binarization processing on the second accumulated value to obtain a second fingerprint value.
In one embodiment, calculating the hamming distance between the first fingerprint value and the second fingerprint value comprises: detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same; counting the number of different data; the hamming distance is determined by the number.
In one embodiment, a method for processing account checking difference data further includes: generating a first fingerprint value of the first difference data, and counting a first number of specific bit values in the first fingerprint value; carrying out fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values; generating a first fingerprint value of the second difference data, and counting a second number of specific bit values in the second fingerprint value; and carrying out fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values. Calculating a hamming distance for the first fingerprint value and the second fingerprint value, comprising: when it is detected that the first difference data matches the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, a hamming distance of the first fingerprint value and the second fingerprint value is calculated.
In one embodiment, a method for processing account checking difference data further includes: combining the first sub-fingerprint value and the first number to obtain first combined data; combining the second sub-fingerprint value and the second number to obtain second combined data; a threshold value for the hamming distance is determined. Detecting a match of the first difference data and the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, comprising: determining third combined data according to the first combined data and a threshold value of the Hamming distance; when the second combined data and the third combined data are the same, it is detected that the first difference data matches the second difference data.
In one embodiment, after obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields, the method further includes: acquiring a plurality of data difference types of a first data table and a second data table; classifying the data difference types to obtain a plurality of difference types of account checking; calculating the ratio of each difference category according to the difference data corresponding to each difference category; and graphically displaying each difference category and the proportion of each difference category.
An apparatus for processing reconciliation difference data, comprising: the obtaining module is used for identifying a first data table and a second data table obtained by account checking, obtaining first difference data from the first data table and obtaining second difference data from the second data table; the calculating module is used for calculating the similarity of the first difference data and the second difference data; the first determining module is used for determining the difference data which fails to be matched in the first difference data and the second difference data when the similarity meets the preset condition; the second determining module is used for determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table; and the obtaining module is used for obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the above embodiments when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.
According to the account checking difference data processing method, the account checking difference data processing device, the computer equipment and the storage medium, the first data table and the second data table obtained through account checking are identified, the first difference data are obtained from the first data table, the second difference data are obtained from the second data table, and the similarity between the first difference data and the second difference data is calculated, so that the similarity between the first difference data and the second difference data is automatically calculated, and the similarity between the difference data does not need to be judged through manual experience. Further, when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data, and determining the field corresponding to the difference data which fails to be matched according to the first data table and the second data table. Therefore, the preset conditions are set, the difference data used for judging the data difference type are screened out in a mode that the similarity is matched with the preset conditions, the difference data meeting the analysis of actual business requirements can be screened out through the preset conditions, and the data calculation amount can be reduced. And then, the data difference types of the first difference data and the second difference data are obtained according to the corresponding fields, so that automatic analysis of the difference data is realized, the data difference type of the difference data independent of manual experience is obtained, the difference reason of the difference data can be determined based on the data difference type, and the accuracy of the difference reason of the difference data in the account checking is improved.
Drawings
FIG. 1 is a diagram of an application environment of a method for processing reconciliation difference data in one embodiment;
FIG. 2 is a flow diagram illustrating a method for processing reconciliation difference data in accordance with one embodiment;
FIG. 3 is a diagram illustrating the results of a Hash Hash algorithm and a SimHash algorithm on differential data processing in one embodiment;
FIG. 4 is a diagram illustrating a computation process of the SimHash algorithm in one embodiment;
FIG. 5 is a diagram illustrating how Hamming distance is calculated in one embodiment;
FIG. 6 is a diagram illustrating the principle of fingerprint segmentation of fingerprint values in one embodiment;
FIG. 7 is a diagram illustrating an embodiment of a mapping of fingerprint segments and bit 1 numbers to complete fingerprints;
FIG. 8 is a flow diagram that illustrates the processing of similarity of difference data in one embodiment;
FIG. 9 is a table diagram of difference data in one embodiment;
FIG. 10 is a diagram representation of a similarity analysis report for difference data in one embodiment;
FIG. 11 is a block diagram of an apparatus for processing reconciliation difference data in one embodiment;
FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The application provides a processing method of account checking difference data, which is applied to an application environment shown in fig. 1. As shown in fig. 1, the server cluster 102 is used to implement a method for processing account checking difference data of the present application. The terminal 104 is configured to configure the server cluster 102, for example, configure preset conditions of the reconciliation difference data. The database 106 stores tables of difference data obtained by reconciliation, such as a first data table 1062 and a second data table 1064. Specifically, the server cluster 102 identifies the first data table 1062 and the second data table 1064 obtained by reconciliation, reads the first difference data in the first data table 1062 from the database 106, and reads the second difference data in the second data table 1064. Further, the similarity of the first difference data and the second difference data is calculated. In addition, preset conditions are configured in the terminal 104, and the preset conditions are input into the server cluster 102. When the server cluster 102 detects that the similarity meets the preset condition, the difference data which fails to be matched in the first difference data and the second difference data is determined, the field corresponding to the difference data which fails to be matched is determined according to the first data table 1062 and the second data table 1064, the data difference type of the first difference data and the data difference type of the second difference data are obtained according to the corresponding field, and the data difference type is displayed in the terminal 104, so that business personnel can visually determine the data difference type of the account difference data, manual data difference judgment is not needed, and the accuracy of the difference reason of the difference data in account checking is improved.
In an embodiment, the method for processing account checking difference data provided by the present application is applied to the server cluster 102 shown in fig. 1. As shown in fig. 2, a processing method of reconciliation difference data includes the following steps:
s202, identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table.
In this embodiment, when performing data reconciliation, data that cannot be matched by both reconciliation parties is referred to as difference data. The two reconciliation parties are assumed to be the party A and the party B respectively. Each piece of data in the A party and the B party respectively has a plurality of fields. The reconciliation refers to finding out the data which are matched with the matching basis by using the field values of some fields as the matching basis, and the remaining data which can not find the matching is the difference data. For example, the field values are equal as matching bases, the party a is order data, the party B is bill data, and the matching bases for reconciliation are "order number of party a is the business flow number of party B" and "amount due by party a is the amount actually paid by party B".
The difference data are stored through the first data table and the second data table respectively. The first data table and the second data table respectively store difference data of each account checking party. For example, a first data table stores difference data in party a that cannot be matched with party B, and a second data table stores difference data in party B that cannot be matched with party a.
The server identifies a first data table and a second data table obtained by reconciliation, and acquires first difference data from the first data table and second difference data from the second data table. The first difference data may be any piece of data in the first data table, where any piece of data has one or more corresponding fields, and a field value corresponding to each field is the difference data in any piece of data. The second difference data may be any piece of data in the second data table, where any piece of data has one or more corresponding fields, and a field value corresponding to each field is the difference data in any piece of data.
And S204, calculating the similarity of the first difference data and the second difference data.
The difference analysis of the reconciliation data can be considered as the calculation of the similarity of the data of the two parties of the reconciliation. If the data of the two parties are completely the same or the similarity is the maximum value, the data matching between the two parties is successful. In this embodiment, the server calculates the similarity between the first difference data and the second difference data, so as to determine, through the similarity, whether the difference data of the reconciliation parties is difference data that would have been successfully matched but is formed due to some errors. The way of calculating the similarity of the first difference data and the second difference data may be: and selecting a proper similarity algorithm, such as a SimHash and Hamming distance algorithm, according to the specific scene characteristics. The similarity of the first difference data and the second difference data is calculated by a similarity algorithm.
And S206, when the similarity meets the preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data.
In this embodiment, a manager configures preset conditions in advance through a terminal, and stores the preset conditions in a server. When the server calculates the similarity of the first difference data and the second difference data, whether the similarity meets a preset condition is detected. If yes, the difference data which fail to be matched in the first difference data and the second difference data are further determined. If not, the subsequent operation of determining the difference data which fails to be matched in the first difference data and the second difference data is not executed. The preset condition may be a similarity threshold, and the condition that the similarity meets the preset condition means that the similarity is greater than the similarity threshold. Therefore, the first difference data and the second difference data are screened by setting the preset conditions, so that the difference data which is successfully matched in the account checking but is formed due to certain errors can be screened, manual operation is not needed, and the screening efficiency of the difference data is improved.
And when the similarity meets a preset condition, matching the first difference data with the second difference data, and identifying the difference data which fails in matching. The matching failure may mean that data at corresponding positions in the first difference data and the second difference data are not equal, or that the data at corresponding positions in the first difference data and the second difference data do not satisfy a set matching condition.
And S208, determining the field corresponding to the difference data which fails to be matched according to the first data table and the second data table.
In this embodiment, each difference data of the first difference data in the first data table corresponds to a field, and each difference data of the second difference data in the second data table corresponds to a field. And the server determines the field corresponding to the difference data which fails to be matched according to the first data table and the second data table. In general, when the similarity between one piece of data in the first data table and one piece of data in the second data table reaches a preset condition, it indicates that the fields corresponding to the two pieces of data are the same. Therefore, when determining the difference data failing to match, the field corresponding to the difference data failing to match can be determined from the first data table or the second data table. The corresponding fields may be fields used by the reconciliation rule, such as order number, account number, amount, and date.
S210, obtaining the data difference type of the first difference data and the second difference data according to the corresponding field.
In this embodiment, the server obtains the data difference types of the first difference data and the second difference data through the field content of the corresponding field. The data difference type is used for representing the reason of difference of the difference data when the data are checked. For example, the data difference type is that the amount of the account checking parties is different, the time of the account checking parties is different, the single number of the account checking parties is different, and the like. Specifically, the server identifies the field content of the corresponding field, and determines the data difference type of the first difference data and the second difference data based on the field content. And if the field content of the corresponding field is the price, determining that the data difference type is that the amounts of the account checking parties are different. And if the field content of the corresponding field is date, determining that the data difference type is that the time of the account checking party is different.
The processing method of the account checking difference data comprises the steps of identifying a first data table and a second data table obtained by account checking, obtaining first difference data from the first data table, obtaining second difference data from the second data table, calculating the similarity between the first difference data and the second difference data, automatically calculating the similarity between the first difference data and the second difference data, and judging the similarity between the difference data without manual experience. Further, when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data, and determining the field corresponding to the difference data which fails to be matched according to the first data table and the second data table. Therefore, by setting the preset conditions and screening the difference data for judging the data difference type in a matching mode of the similarity and the preset conditions, the difference data meeting the actual business requirement analysis can be screened through the preset conditions, and the data calculation amount can be reduced. And then, the data difference types of the first difference data and the second difference data are obtained according to the corresponding fields, so that automatic analysis of the difference data is realized, the data difference type of the difference data independent of manual experience is obtained, the difference reason of the difference data can be determined based on the data difference type, and the accuracy of the difference reason of the difference data in the account checking is improved.
In an embodiment, the calculating the similarity between the first difference data and the second difference data includes: acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing second difference data by adopting the SimHash algorithm; calculating a hamming distance between the first fingerprint value and the second fingerprint value; and obtaining the similarity according to the Hamming distance.
The SimHash algorithm consists in Sim (similar) + Hash (Hash). The Hash algorithm can map original data with different lengths into data with another fixed length, thereby facilitating the simplification of subsequent calculation and analysis. However, the ordinary Hash algorithm maps the original data to another data in a random manner, and the mapped fixed-length fingerprint cannot carry the characteristic information of the original data, so that the requirement of similarity calculation cannot be met. The SimHash algorithm is used as a local sensitive Hash algorithm, so that the mapped fingerprint and original data have correlation, and the method is specifically represented as follows: the more similar the raw data, the more similar its fingerprint, as shown in fig. 3. The SimHash algorithm is therefore often used for search engines to re-order similar web pages, school institutions to retrieve similar documents, etc.
In this embodiment, the first difference data and the second difference data are respectively processed by a SimHash algorithm to obtain a first fingerprint value and a second fingerprint value. And then, calculating the Hamming distance between the first fingerprint value and the second fingerprint value, and expressing the similarity between the first fingerprint value and the second fingerprint value through the Hamming distance. The smaller the hamming distance, the more similar the fingerprint. And finally, obtaining the similarity of the first difference data and the second difference data according to the Hamming distance.
The first difference data and the second difference data are processed through a SimHash algorithm, and the difference data of each dimension in the first difference data and the second difference data contribute to a finally generated fingerprint value, so that the finally obtained fingerprint value can represent each dimension characteristic of the original data. The similarity calculation is carried out on the fingerprint values obtained by adopting the SimHash algorithm, and the data dimension reduction can be carried out on the original data, such as the first difference data and the second difference data, so that the original calculation process needing multi-dimension comparison is simplified into two simple data comparisons, and the calculation amount is greatly reduced.
In an embodiment, the obtaining the first fingerprint value and the second fingerprint value includes: performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value; modifying the value 0 in the third fingerprint value into a value-1, and multiplying each modified value in the third fingerprint value by the corresponding weight to obtain a first weight fingerprint value; accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value; carrying out binarization processing on the first accumulated value to obtain a first fingerprint value; performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value; modifying the value 0 in the fourth fingerprint value into a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value; accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value; and carrying out binarization processing on the second accumulated value to obtain a second fingerprint value.
As shown in fig. 4, the original difference data of the reconciliation result is calculated by using the SimHash algorithm to obtain the fingerprint, which mainly comprises the following steps:
1) and taking fields used by the reconciliation rule as features, wherein each field value is a feature value. The field used by the reconciliation rule is a field of a data table obtained by reconciliation, the field value is a numerical value corresponding to the field, and the field value is also the difference data in the application.
2) A general Hash is computed for each eigenvalue, resulting in a Hash fingerprint of N (N is the number of eigenvalues) dimensional eigenvalues, the length of which is a fixed value, for example, 64 bits, and the value of each bit is 0 or 1.
3) Changing the 0 bit value of each dimension characteristic Hash fingerprint into-1, and multiplying the value of each position by the weight of the characteristic dimension to obtain N weight fingerprints.
4) And aligning and accumulating the bits of the N weight fingerprints in sequence to obtain an accumulated value with the length of 64 bits, wherein the value at each position is a positive integer, 0 or a negative integer.
5) And (4) carrying out binarization processing on each position of the accumulated value obtained in the last step, wherein the specific method is that a positive integer is changed into 1, otherwise, the positive integer is changed into 0, and thus a new 64-bit fingerprint is obtained, namely the SimHash fingerprint corresponding to the original data.
The embodiment adopts the above method to obtain the first fingerprint value and the second fingerprint value. Wherein the third fingerprint value refers to a Hash fingerprint. Similarly, the fourth fingerprint value refers to a Hash fingerprint. The first weight fingerprint value is N weight fingerprints obtained by changing the 0 bit value of the third fingerprint value to-1 and multiplying the value of each position by the weight of the characteristic dimension. Similarly, the second weight fingerprint value is N weight fingerprints obtained by changing the 0-bit value of the fourth fingerprint value to-1 and then multiplying the value of each position by the weight of the characteristic dimension.
In the first difference data and the second difference data, the difference data corresponding to each dimension characteristic value respectively contributes to each position of the final fingerprint value, so that the fingerprint value obtained by adopting the SimHash algorithm can represent each dimension characteristic of the original difference data.
In one embodiment, the calculating the hamming distance between the first fingerprint value and the second fingerprint value includes: detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same; counting the number of different data; the hamming distance is determined by the number.
The calculation of hamming distance is shown in fig. 5 and can be summarized as follows: and comparing the position data of the two fingerprints bit by bit, counting the number of positions with different values, and obtaining the number of positions by statistics, namely the Hamming distance between the two fingerprints. In this embodiment, it is compared whether the data at the same position in the first fingerprint value and the second fingerprint value are the same. And counting the number of data with different data, and determining the Hamming distance according to the number. Furthermore, the similarity of the first difference data and the second difference data can be determined according to the hamming distance.
In one embodiment, before the step of calculating the hamming distance between the first fingerprint value and the second fingerprint value, the method further includes: generating a first fingerprint value of the first difference data, and counting a first number of specific bit values in the first fingerprint value; carrying out fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values; generating a first fingerprint value of the second difference data, and counting a second number of specific bit values in the second fingerprint value; and carrying out fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values. The calculating the hamming distance between the first fingerprint value and the second fingerprint value includes: when it is detected that the first difference data matches the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, a hamming distance of the first fingerprint value and the second fingerprint value is calculated.
The computer bit operation speed is very fast, so the Hamming distance calculation efficiency is very high. However, if the hamming distances of all the data are completely calculated, the calculation amount is the product of the difference data amounts of the account checking parties. For example, if the amount of difference data between the a side and the B side is 100 ten thousand, the total calculation amount is 100 ten thousand times 100 ten thousand, i.e., 1 trillion times, and if a single-core computer can calculate a hamming distance 1 million times per second, it still takes about 11.574 days to complete. It is therefore necessary to try to reduce the amount of computation in order to obtain a valuable hamming distance within an acceptable time frame.
The method for reducing the calculation amount of the embodiment is the comprehensive application of the fingerprint segmentation and the counting range.
Fingerprint segmentation:
the method is characterized in that 10 apples are placed in 9 drawers, and no matter how the apples are placed, more than 1 apple is arranged in at least one drawer, which is the drawer principle. Similarly, assuming that the expected maximum hamming distance, i.e. the threshold value, is n (0< n <64), if the 64 bits of each fingerprint value obtained by using the SimHash algorithm are split into consecutive (n +1) segments, at least 1 segment of their fingerprint segments is identical if the two fingerprint values meet the similarity requirement, i.e. their hamming distance is less than or equal to n, as shown in fig. 6.
Counting range:
the 64-bit content of each fingerprint value is counted to obtain the number of positions with the value of 1, which is assumed to be called the number of 1 bits. Assuming that the number of bits 1 of a certain fingerprint value is n (0< ═ n < ═ 64) and the expected maximum hamming distance threshold is d, it is clear that the number of bits 1 of other fingerprint values with which similar requirements may be met must be in the range of [ max (n-d, 0), min (n + d, 64) ]. For example, if the number of bits 1 of a certain fingerprint value is 22 and the hamming distance threshold is 5, then the number of bits 1 of the fingerprint value that meets similar requirements must be in the range of 17 to 27. In fact, the root of the counting range remains the application of the drawer principle.
According to the above principle, in this embodiment, the first fingerprint value and the second fingerprint value are counted to obtain the number of bits 1, and are divided into a plurality of segments according to rules, and one of the segments is constructed as an index mapping structure, and the mapping key is "fingerprint segment content — original complete fingerprint bits 1 number", and the value is "original complete fingerprint", as shown in fig. 7. And traversing each complete fingerprint by taking the other side as a driving side, sequentially taking each fingerprint segment of the other side, combining the number of bits 1 of each fingerprint segment to obtain a search key which possibly meets similar requirements, taking the search key as a key to find a corresponding complete fingerprint value in an index mapping structure, and calculating the Hamming distance of the search key.
In this embodiment, the first fingerprint value is subjected to fingerprint segmentation in the above manner, so as to obtain multiple segments of first sub-fingerprint values. And counting the number of the specific bit values in the first fingerprint to obtain a first number. Wherein the specific bit value may be 1. And obtaining a plurality of sections of second sub-fingerprint values and second numbers in the same way. Wherein, the first fingerprint value is used for fingerprint segmentation, and the method comprises the following steps: and carrying out fingerprint segmentation on the first fingerprint value according to a threshold value of the Hamming distance. Fingerprint segmenting the second fingerprint value, including: and carrying out fingerprint segmentation on the second fingerprint value according to the threshold value of the Hamming distance. And further detecting whether the first difference data is matched with the second difference data or not according to the multiple sections of the first sub-fingerprint values and the first quantity, and the second sub-fingerprint values and the second quantity, and if so, calculating the Hamming distance between the first fingerprint value and the second fingerprint value. Therefore, the amount of calculation of the system can be reduced.
In one embodiment, the step of calculating the hamming distance between the first fingerprint value and the second fingerprint value when the first difference data is detected to match the second difference data according to the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number is further preceded by: combining the first sub-fingerprint value and the first number to obtain first combined data; combining the second sub-fingerprint value and the second number to obtain second combined data; a threshold value for the hamming distance is determined. The detecting that the first difference data matches the second difference data according to the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number includes: determining third combined data according to the first combined data and a threshold value of the Hamming distance; when the second combined data and the third combined data are the same, it is detected that the first difference data matches the second difference data.
In this embodiment, the first sub-fingerprint value and the first number are combined to construct first combined data. And combining the second sub-fingerprint value and the second quantity to construct second combined data. For example, if the 64-bit binary system of the first fingerprint value is 0010101010101010001001101001000100100111001010101111010000110110 through the SimHash algorithm, and the Hamming distance threshold is 3, the first fingerprint value needs to be split into 4 segments from left to right according to bits, each segment has 16 bits, namely 0010101010101010 (1 st to 16 th bits), "0010011010010001 (17 th to 32 th bits)," 0010011100101010 (33 th to 48 th bits) and "1111010000110110" (49 th to 64 th bits), and the number of bit 1 is 29. Respectively converting the four segments into decimal representation, and combining the decimal representation with the number of 1 bits to obtain four first combined data: "10922 _ 29", "9873 _ 29", "10026 _ 29", "62518 _ 29". Similarly, assume that the decimal notation of the four segments of the second fingerprint value is "10922", "4673", "26702", "8741", with a number of bits 1 of 30. Four second combined data were obtained: "10922 _ 30", "4673 _ 30", "26702 _ 30", "8741 _ 30".
If the Hamming distance between the first and second fingerprint values does not exceed 3, the number of bits 1 of the second fingerprint value must be within a reasonable range of (30-3) to (30+3), i.e., [27, 33 ]. Then the third combined data is determined according to the first combined data and the threshold value of hamming distance, and the third combined data includes 10922_27 "," 10922_28 "," 10922_29 "… …" 8741_33 ". When any combination data in the second combination data is the same as any combination data in the third combination data, it is detected that the first difference data matches the second difference data. Therefore, the calculation of Cartesian product quantity of the fingerprint value sets of both account checking parties is avoided, and the system calculation quantity is reduced.
In one embodiment, after obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields, the method further includes: acquiring a plurality of data difference types of a first data table and a second data table; classifying the data difference types to obtain a plurality of difference types of account checking; calculating the ratio of each difference category according to the difference data corresponding to each difference category; and graphically displaying each difference category and the proportion of each difference category.
In this embodiment, the server obtains multiple data difference types of the first data table and the second data table by using the above embodiments, and a specific obtaining flow is shown in fig. 8. Referring to the flowchart shown in fig. 8, the server obtaining a plurality of data difference types of the first data table and the second data table includes the steps of:
1) obtaining difference data in a first data table and a second data table
And reading difference data of account checking parties from the data storage, wherein the account checking parties are an A party and a B party. Typical data stores are, for example, relational databases, Hive databases, MongoDB databases, etc.
2) Respectively calculating characteristic fingerprint, fingerprint segment and bit count of the difference data of the A side and the B side
And respectively calculating fingerprints corresponding to each piece of difference data for the original difference data sets of the two parties read in the last step to obtain fingerprint data sets related to the difference data of the party A and the party B.
Further, in order to optimize the calculation of the subsequent hamming distance, after the fingerprint data set is obtained, the fingerprint segmentation and the bit counting are respectively performed. Where the bit count may be a bit 1 number.
3) One party constructs index mapping and the other party constructs search key set
And establishing index mapping by taking each fingerprint value in the fingerprint data set of one party as a key according to the combination of the fingerprint segmentation value and the bit count.
4) Computing Hamming distance of fingerprint value
And circularly traversing each fingerprint value of the other party, constructing a proper key according to fingerprint segmentation and bit counting, searching the fingerprint value of the other party in the index mapping of the previous step, calculating one by one to obtain a Hamming distance if the related fingerprint value of the other party is found, and screening by combining a threshold value.
5) Forming similarity reference results
And (4) summarizing the similar difference data pairs obtained in the last step, classifying the same field conditions into the same class, calculating the quantity and the ratio, and obtaining a chart for reference auxiliary analysis.
That is, the fingerprint value of each piece of difference data in each data table is calculated, and the fingerprint value is processed in a fingerprint segmentation and bit counting manner, so as to obtain each fingerprint value and the segmented fingerprint value and the number of bits 1 corresponding to each fingerprint value. And establishing index mapping by taking each fingerprint value of the first data table as a key according to the combination of the segmented fingerprint value and the bit count. Thereafter, each fingerprint value in the second data table is cycled through. And further, calculating the Hamming distance between the fingerprint value in the first data table and the fingerprint value obtained by traversal, and obtaining similar difference data of the first data table and the second data table according to the Hamming distance. A data variance type is determined for each similar variance data. And classifying the data difference types into the same type according to the same field to obtain a plurality of difference types of account checking. And then calculating the quantity and the ratio of each difference category, and further carrying out graphical display on each difference category, the quantity and the ratio of each difference category to obtain a chart.
For example, specific embodiments are as follows;
1) configuration platform
And a relational database (Oracle/DB2/SQLServer/MySQL and the like) is used as data persistence storage and is responsible for storing data such as the position of difference data obtained by reconciliation to be analyzed, a reconciliation correlation matching rule, a maximum Hamming distance threshold and the like.
The method can be realized by using a B/S or C/S architecture, and the Server end application, such as SpringBoot + MyBatis, is realized by using a framework and a technical stack which conform to JEE technical specification. NET/C # and other technology stacks can also be used.
To prevent single point of failure, application clusters may be deployed for soft/hard load balancing via Apache/Nginx/F5, etc. From the viewpoint of cost and importance, Apache/Nginx is generally used for soft load balancing.
2) Computing platform
In order to obtain high availability and horizontal expansion capability, a computing platform is proposed to be constructed by adopting distributed computing clusters, for example, large data clusters existing in enterprises are utilized, and YARN is adopted as a resource management coordination component of the computing platform. YARN, Apache Hadoop YARN Yet other Resource coordinator, Another Resource coordinator. Spark is adopted as a programming framework and a calculation engine of an actual reconciliation task application, and Hive/MongoDB is adopted as persistent storage of similarity analysis data.
3) Difference data and simulation results
The simulation test data volume, key indicators and result samples are shown in fig. 9. The resulting graph is shown in FIG. 10.
According to the account checking difference data processing method, the similarity calculation in the machine learning field is innovatively applied to the difference analysis scene in the financial account checking field, personalized parameter adjustment of different specific account checking examples is supported, the method is simple to realize, optimization improvement is performed on the basis of an original algorithm to improve the calculation speed, and finally a friendly, visual and valuable analysis auxiliary result is formed, so that the difficulty of difference analysis is greatly reduced, the condition that business financial staff cannot perform manual and blind investigation is changed, the efficiency and effect of difference reason analysis are improved, and a large amount of labor cost is saved.
It should be understood that, although the steps in the flowchart are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
The present application further provides a device for processing account checking difference data, as shown in fig. 11, the device includes an obtaining module 1102, a calculating module 1104, a first determining module 1106, a second determining module 1108, and an obtaining module 1110. An obtaining module 1102, configured to identify a first data table and a second data table obtained by reconciliation, obtain first difference data from the first data table, and obtain second difference data from the second data table; a calculating module 1104, configured to calculate a similarity between the first difference data and the second difference data; a first determining module 1106, configured to determine, when the similarity satisfies a preset condition, difference data that fails to be matched in the first difference data and the second difference data; a second determining module 1108, configured to determine, according to the first data table and the second data table, a field corresponding to the difference data that fails to be matched; an obtaining module 1110, configured to obtain a data difference type of the first difference data and the second difference data according to the corresponding field.
In one embodiment, calculating the similarity of the first difference data and the second difference data comprises: acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing second difference data by adopting the SimHash algorithm; calculating a hamming distance between the first fingerprint value and the second fingerprint value; and obtaining the similarity according to the Hamming distance.
In one embodiment, obtaining the first fingerprint value and the second fingerprint value comprises: performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value; modifying the value 0 in the third fingerprint value into a value-1, and multiplying each modified value in the third fingerprint value by the corresponding weight to obtain a first weight fingerprint value; accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value; carrying out binarization processing on the first accumulated value to obtain a first fingerprint value; performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value; modifying the value 0 in the fourth fingerprint value into a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value; accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value; and carrying out binarization processing on the second accumulated value to obtain a second fingerprint value.
In one embodiment, calculating a hamming distance for the first fingerprint value and the second fingerprint value includes: detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same; counting the number of different data; the hamming distance is determined by the number.
In one embodiment, a device for processing account checking difference data further comprises a processing module. The processing module is used for generating a first fingerprint value of the first difference data, counting a first number of specific bit values in the first fingerprint value, carrying out fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values, generating a first fingerprint value of the second difference data, counting a second number of specific bit values in the second fingerprint value, and carrying out fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values; calculating a hamming distance for the first fingerprint value and the second fingerprint value, comprising: when it is detected that the first difference data matches the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, a hamming distance of the first fingerprint value and the second fingerprint value is calculated.
In one embodiment, the processing device of the account checking difference data further comprises a combination module. The combination module is used for combining the first sub-fingerprint value and the first number to obtain first combination data, combining the second sub-fingerprint value and the second number to obtain second combination data, and determining a threshold value of the Hamming distance; detecting a match of the first difference data and the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, comprising: determining third combined data according to the first combined data and a threshold value of the Hamming distance; when the second combined data and the third combined data are the same, it is detected that the first difference data matches the second difference data.
In one embodiment, after obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields, the method further includes: acquiring a plurality of data difference types of a first data table and a second data table; classifying the data difference types to obtain a plurality of difference types of account checking; calculating the ratio of each difference category according to the difference data corresponding to each difference category; and graphically displaying each difference category and the proportion of each difference category.
For specific limitation of the processing device of the account checking difference data, reference may be made to the above limitation on the processing method of the account checking difference data, and details are not described here. The modules in the device for processing account checking difference data can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is adapted to interface with a database to read the difference data from the database. The computer program is executed by a processor to implement a method of processing reconciliation difference data.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table; calculating the similarity of the first difference data and the second difference data; when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data; determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table; and obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields.
In one embodiment, when the processor executes the computer program to implement the step of calculating the similarity between the first difference data and the second difference data, the following steps are specifically implemented: acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing second difference data by adopting the SimHash algorithm; calculating a hamming distance between the first fingerprint value and the second fingerprint value; and obtaining the similarity according to the Hamming distance.
In one embodiment, when the processor executes the computer program to implement the above step of obtaining the first fingerprint value and the second fingerprint value, the following steps are specifically implemented: performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value; modifying the value 0 in the third fingerprint value into a value-1, and multiplying each modified value in the third fingerprint value by the corresponding weight to obtain a first weight fingerprint value; accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value; carrying out binarization processing on the first accumulated value to obtain a first fingerprint value; performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value; modifying the value 0 in the fourth fingerprint value into a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value; accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value; and carrying out binarization processing on the second accumulated value to obtain a second fingerprint value.
In one embodiment, when the processor executes the computer program to implement the above step of calculating the hamming distance between the first fingerprint value and the second fingerprint value, the following steps are specifically implemented: detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same; counting the number of different data; the hamming distance is determined by the number.
In one embodiment, the processor, when executing the computer program, further performs the steps of: generating a first fingerprint value of the first difference data, counting a first number of specific bit values in the first fingerprint value, performing fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values, generating a first fingerprint value of the second difference data, counting a second number of specific bit values in the second fingerprint value, and performing fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values. When the processor executes the computer program to realize the step of calculating the hamming distance between the first fingerprint value and the second fingerprint value, the following steps are specifically realized: when it is detected that the first difference data matches the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, a hamming distance of the first fingerprint value and the second fingerprint value is calculated.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and combining the first sub-fingerprint value and the first quantity to obtain first combined data, combining the second sub-fingerprint value and the second quantity to obtain second combined data, and determining the threshold value of the Hamming distance. When the processor executes the computer program to realize the step of detecting that the first difference data and the second difference data are matched according to the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, the following steps are specifically realized: determining third combined data according to the first combined data and a threshold value of the Hamming distance; when the second combined data and the third combined data are the same, it is detected that the first difference data matches the second difference data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a plurality of data difference types of a first data table and a second data table; classifying the data difference types to obtain a plurality of difference types of account checking; calculating the ratio of each difference category according to the difference data corresponding to each difference category; and graphically displaying each difference category and the proportion of each difference category.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table; calculating the similarity of the first difference data and the second difference data; when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data; determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table; and obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields.
In one embodiment, when the computer program is executed by the processor to implement the step of calculating the similarity between the first difference data and the second difference data, the following steps are specifically implemented: acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing second difference data by adopting the SimHash algorithm; calculating a hamming distance between the first fingerprint value and the second fingerprint value; and obtaining the similarity according to the Hamming distance.
In one embodiment, when the computer program is executed by the processor to implement the above step of obtaining the first fingerprint value and the second fingerprint value, the following steps are specifically implemented: performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value; modifying the value 0 in the third fingerprint value into a value-1, and multiplying each modified value in the third fingerprint value by the corresponding weight to obtain a first weight fingerprint value; accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value; carrying out binarization processing on the first accumulated value to obtain a first fingerprint value; performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value; modifying the value 0 in the fourth fingerprint value into a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value; accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value; and carrying out binarization processing on the second accumulated value to obtain a second fingerprint value.
In one embodiment, when the computer program is executed by the processor to implement the above step of calculating the hamming distance between the first fingerprint value and the second fingerprint value, the following steps are specifically implemented: detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same; counting the number of different data; the hamming distance is determined by the number.
In one embodiment, the computer program when executed by the processor further performs the steps of: generating a first fingerprint value of the first difference data, counting a first number of specific bit values in the first fingerprint value, performing fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values, generating a first fingerprint value of the second difference data, counting a second number of specific bit values in the second fingerprint value, and performing fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values. When the computer program is executed by the processor to realize the step of calculating the hamming distance between the first fingerprint value and the second fingerprint value, the following steps are specifically realized: when it is detected that the first difference data matches the second difference data based on the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, a hamming distance of the first fingerprint value and the second fingerprint value is calculated.
In one embodiment, the computer program when executed by the processor further performs the steps of: and combining the first sub-fingerprint value and the first quantity to obtain first combined data, combining the second sub-fingerprint value and the second quantity to obtain second combined data, and determining the threshold value of the Hamming distance. When the computer program is executed by the processor to implement the step of detecting that the first difference data matches the second difference data according to the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number, the following steps are specifically implemented: determining third combined data according to the first combined data and a threshold value of the Hamming distance; when the second combined data and the third combined data are the same, it is detected that the first difference data matches the second difference data.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a plurality of data difference types of a first data table and a second data table; classifying the data difference types to obtain a plurality of difference types of account checking; calculating the ratio of each difference category according to the difference data corresponding to each difference category; and graphically displaying each difference category and the proportion of each difference category.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for processing reconciliation difference data, the method comprising:
identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table, and acquiring second difference data from the second data table;
calculating the similarity of the first difference data and the second difference data;
when the similarity meets a preset condition, determining the difference data which fails to be matched in the first difference data and the second difference data;
determining a field corresponding to the difference data which fails to be matched according to the first data table and the second data table;
and obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields.
2. The method of claim 1, wherein calculating the similarity of the first difference data and the second difference data comprises:
acquiring a first fingerprint value and a second fingerprint value, wherein the first fingerprint value is obtained by processing the first difference data by adopting a SimHash algorithm, and the second fingerprint value is obtained by processing the second difference data by adopting the SimHash algorithm;
calculating a hamming distance of the first fingerprint value and the second fingerprint value;
and obtaining the similarity according to the Hamming distance.
3. The method of claim 2, wherein obtaining the first fingerprint value and the second fingerprint value comprises:
performing hash calculation on each difference data in the first difference data to obtain a third fingerprint value;
modifying a value 0 in the third fingerprint value to a value-1, and multiplying each modified value in the third fingerprint value by a corresponding weight to obtain a first weight fingerprint value;
accumulating the position values of the first weight fingerprint value in sequence to obtain a first accumulated value;
carrying out binarization processing on the first accumulated value to obtain a first fingerprint value;
performing hash calculation on each difference data in the second difference data to obtain a fourth fingerprint value;
modifying the value 0 in the fourth fingerprint value to a value-1, and multiplying each modified value in the fourth fingerprint value by the corresponding weight to obtain a second weight fingerprint value;
accumulating the position values of the second weight fingerprint value in sequence to obtain a second accumulated value;
and carrying out binarization processing on the second accumulated value to obtain the second fingerprint value.
4. The method of claim 2, wherein the calculating the hamming distance between the first fingerprint value and the second fingerprint value comprises:
detecting whether the data at the same position in the first fingerprint value and the second fingerprint value are the same;
counting the number of different data;
determining the Hamming distance based on the number.
5. The method of claim 2, further comprising:
generating a first fingerprint value of the first difference data, and counting a first number of specific bit values in the first fingerprint value;
carrying out fingerprint segmentation on the first fingerprint value to obtain a plurality of sections of first sub-fingerprint values;
generating a first fingerprint value of the second difference data, and counting a second number of the specific bit values in the second fingerprint value;
carrying out fingerprint segmentation on the second fingerprint value to obtain a plurality of sections of second sub-fingerprint values;
the calculating the Hamming distance of the first fingerprint value and the second fingerprint value includes:
calculating a Hamming distance of the first fingerprint value and the second fingerprint value when it is detected that the first difference data matches the second difference data according to the first sub-fingerprint value and the first number and the second sub-fingerprint value and the second number.
6. The method of claim 5, further comprising:
combining the first sub-fingerprint value and the first number to obtain first combined data;
combining the second sub-fingerprint value and the second number to obtain second combined data;
determining a threshold value of the hamming distance;
said detecting that said first difference data matches said second difference data based on said first sub-fingerprint value and said first number and said second sub-fingerprint value and said second number comprises:
determining third combined data according to the first combined data and the threshold value of the Hamming distance;
detecting that the first difference data matches the second difference data when the second combined data and the third combined data are the same.
7. The method according to claim 1, wherein after obtaining the data difference types of the first difference data and the second difference data according to the corresponding fields, further comprising:
obtaining a plurality of data difference types of the first data table and the second data table;
classifying the data difference types to obtain a plurality of difference categories of the account checking;
calculating the occupation ratio of each difference type according to the difference data corresponding to each difference type;
and graphically displaying the difference categories and the ratios of the difference categories.
8. An apparatus for processing account reconciliation difference data, the apparatus comprising:
the system comprises an acquisition module, a verification module and a verification module, wherein the acquisition module is used for identifying a first data table and a second data table obtained by account checking, acquiring first difference data from the first data table and acquiring second difference data from the second data table;
a calculating module, configured to calculate a similarity between the first difference data and the second difference data;
the first determining module is used for determining the difference data which fails to be matched in the first difference data and the second difference data when the similarity meets a preset condition;
a second determining module, configured to determine, according to the first data table and the second data table, a field corresponding to the difference data that fails to be matched;
an obtaining module, configured to obtain a data difference type of the first difference data and the second difference data according to the corresponding field.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110512158.7A 2021-05-11 2021-05-11 Account checking difference data processing method and device, computer equipment and storage medium Pending CN113283973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110512158.7A CN113283973A (en) 2021-05-11 2021-05-11 Account checking difference data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110512158.7A CN113283973A (en) 2021-05-11 2021-05-11 Account checking difference data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113283973A true CN113283973A (en) 2021-08-20

Family

ID=77278543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110512158.7A Pending CN113283973A (en) 2021-05-11 2021-05-11 Account checking difference data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113283973A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115205023A (en) * 2022-06-24 2022-10-18 平安科技(深圳)有限公司 Bill data monitoring method, device, medium and equipment
US11586599B1 (en) * 2021-11-11 2023-02-21 Bank Of America Corporation Smart data warehouse protocols

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344154A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 Data processing method, device, electronic equipment and storage medium
CN112465631A (en) * 2020-12-16 2021-03-09 深圳乐信软件技术有限公司 Reconciliation difference processing method, system, server and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344154A (en) * 2018-08-22 2019-02-15 中国平安人寿保险股份有限公司 Data processing method, device, electronic equipment and storage medium
CN112465631A (en) * 2020-12-16 2021-03-09 深圳乐信软件技术有限公司 Reconciliation difference processing method, system, server and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11586599B1 (en) * 2021-11-11 2023-02-21 Bank Of America Corporation Smart data warehouse protocols
CN115205023A (en) * 2022-06-24 2022-10-18 平安科技(深圳)有限公司 Bill data monitoring method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN109598095B (en) Method and device for establishing scoring card model, computer equipment and storage medium
CN108876600B (en) Early warning information pushing method, device, computer equipment and medium
JP7169369B2 (en) Method, system for generating data for machine learning algorithms
CN110009225B (en) Risk assessment system construction method, risk assessment system construction device, computer equipment and storage medium
WO2020015089A1 (en) Identity information risk assessment method and apparatus, and computer device and storage medium
US9367580B2 (en) Method, apparatus and computer program for detecting deviations in data sources
CN113283973A (en) Account checking difference data processing method and device, computer equipment and storage medium
CN109325118B (en) Unbalanced sample data preprocessing method and device and computer equipment
US20180121504A1 (en) Method and database computer system for performing a database query using a bitmap index
US20160147867A1 (en) Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program
CN109062936B (en) Data query method, computer readable storage medium and terminal equipment
CN110990529B (en) Industry detail dividing method and system for enterprises
CN111368867B (en) File classifying method and system and computer readable storage medium
WO2020015140A1 (en) Passenger rating model generation method and apparatus, and computer device and storage medium
CN114841789B (en) Block chain-based auditing and auditing evaluation fault data online editing method and system
CN109716660A (en) Data compression device and method
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
US20230273924A1 (en) Trimming blackhole clusters
US10169418B2 (en) Deriving a multi-pass matching algorithm for data de-duplication
US20080010231A1 (en) Rule processing optimization by content routing using decision trees
US20220091818A1 (en) Data feature processing method and data feature processing apparatus
CN110941952A (en) Method and device for perfecting audit analysis model
CN111460268B (en) Method and device for determining database query request and computer equipment
CN114896955A (en) Data report processing method and device, computer equipment and storage medium
CN113723522B (en) Abnormal user identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210820