CN108804542B

CN108804542B - Method for rapidly acquiring file increment based on memory operation

Info

Publication number: CN108804542B
Application number: CN201810465352.2A
Authority: CN
Inventors: 柴磊; 原伟; 柳彦利; 杨峰; 马章焘; 王立强; 冯剑; 付斐; 郭峰; 刘改琴; 李扬; 刘晓霖
Original assignee: Hebei Godsend High Tech Co ltd
Current assignee: Hebei Godsend High Tech Co ltd
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2021-12-07
Anticipated expiration: 2038-05-16
Also published as: CN108804542A

Abstract

The invention discloses a method for quickly acquiring file increment based on memory operation. The invention has the advantages of simple structure and strong practicability.

Description

Method for rapidly acquiring file increment based on memory operation

Technical Field

The invention relates to the fields of G06F17/30, G06F17/00, G06F17 and G06F, in particular to a method for quickly acquiring file increment based on memory operation.

Background

In the ETL process, obtaining incremental data is a very key operation link, and the conventional method is to carry out incremental stripping operation after new data is loaded into a database, so that a large amount of expensive database resources are consumed, the operation speed is low, an incremental stripping program in the database needs to be modified during interface upgrading, manual intervention is more, and the development pressure is higher.

Disclosure of Invention

The invention aims to solve the problems and designs a method for quickly acquiring file increment based on memory operation.

The technical scheme of the invention is that a method for rapidly acquiring file increment based on memory operation comprises the following steps: and extracting characteristic values of the old file according to rows, writing the extracted characteristic values into a specific data structure in the memory, extracting the characteristic values of the new file according to the same algorithm, and inquiring the new file in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict.

The incremental method comprises the following steps:

step one, 2 files needing to obtain the increment are selected. (the old file is hereinafter referred to as the A file, the new file is referred to as the B file, and the delta file is the C file)

And step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment, and the recommended depth value is 4-6.

And step three, taking the Hash value of the file A according to lines, selecting D (4-6) bits in the Hash value as a TireTree index according to the size of the memory and the file, and finding out leaf node pointers of the TireTree one by one according to the indexes.

And fourthly, creating a text message node at the leaf node, and storing the Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, the size of H is preferably a character string which is calculated by the adopted Hash algorithm and is as long as possible, 16-24 bits are recommended, caMsg is used for storing part of original text information, the first M bits of the original text information are recommended, and the size of M is recommended to be 4-8.

And step five, sequentially storing the Hash values of all rows of the whole file and part of the original text into the text information nodes, organizing the text information nodes in a linked list form if the conflict occurs, avoiding the conflict by adopting the linked list, saving part of the memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%.

And sixthly, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6).

And step seven, outputting the character string S to the file C.

The step six of inquiring the nodes comprises the following steps:

(1) and reading a certain line record of the B file and storing a character string variable S.

(2) And taking the Hash value sHash of the S, and using the same D value (4-6) as an index to query text information in the wire Tree.

(3) If the caHash in the text message node is consistent with the sHash, and the caMsg in the text message node is consistent with the first M bits of the S, the S record line in the B file can be judged to exist in the A file.

(4) If the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, skipping can be determined that an S record line in the B file does not exist in the A file, and the line is incremental information.

Compared with the prior art, the method for rapidly acquiring the file increment based on the memory operation, which is manufactured by the technical scheme of the invention, has the beneficial effects that: the incremental operation speed is high and is about 40 times faster than the existing operation in the database, the incremental acquisition of large data volume is obvious, accurate and reliable, the problem of failure or repetition of incremental acquisition is avoided, the memory consumption is low, theoretically, 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out incremental comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects depending on the operation increment of the database in the prior art can be overcome.

Detailed Description

The invention is described in detail below with reference to the drawings, which are shown in fig. 1-6.

The data structure of the wire Tree is shown in fig. 1:

TireTree is shown in fig. 2 (N = 32):

As shown in fig. 3:

As shown in fig. 4:

step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6), wherein the step of inquiring the nodes is as follows:

And step seven, outputting the character string S to the file C.

The process flow for loading the old file is shown in fig. 5:

the process flow for obtaining the increment is shown in fig. 6:

the embodiment is characterized in that the increment method comprises the following steps: the method comprises the steps of extracting characteristic values of old files according to rows, writing the extracted characteristic values into a specific data structure in a memory, extracting the characteristic values of the new files according to the same algorithm, and inquiring the new files in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict, the incremental operation speed is high and is about 40 times faster than that of the existing operation in a database, the large data increment is obtained obviously, accurately and reliably, the problem of increment obtaining failure or repetition is avoided, the memory consumption is low, theoretically, the 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out increment comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects of relying on database operation increment in the prior art can be solved.

The technical solutions described above only represent the preferred technical solutions of the present invention, and some possible modifications to some parts of the technical solutions by those skilled in the art all represent the principles of the present invention, and fall within the protection scope of the present invention.

Claims

1. A method for rapidly acquiring file increment based on memory operation is characterized in that,

the incremental method comprises the following steps:

selecting 2 files needing to obtain an increment;

step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment;

taking a Hash value from the file A according to lines, selecting a D bit in the Hash value as a Tire Tree index according to the size of a memory and the file, and finding out leaf node pointers of the Tire Tree one by one according to the indexes;

step four, creating a text message node at a leaf node, and storing a Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, H is a character string which is calculated by the adopted Hash algorithm and is as long as possible, caMsg is used for storing a part of original text message, and the first M bits of the original text message are taken;

step five, sequentially storing the Hash values and partial original texts of all rows of the whole file into text information nodes, organizing the text information nodes in a linked list form if a conflict occurs, avoiding the conflict by adopting the linked list, saving partial memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%;

step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value;

and step seven, outputting the character string S to the file C.

2. The method for rapidly acquiring the file increment based on the memory operation according to claim 1, wherein the step six of querying the node comprises the steps of:

(1) reading a certain line of record of the file B, and storing a character string variable S;

(2) taking a Hash value sHash of the S, and using the same D value as an index to query text information in the wire Tree;

(3) if the caHash and the sHash in the text message node are consistent and the caMsg in the text message node is consistent with the first M bits of the S, judging that the S record line in the B file exists in the A file;

(4) if the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, the S record line in the B file can be determined to be absent in the A file, and the line is incremental information.