CN108804542A

CN108804542A - A kind of quick obtaining file increment method based on memory operation

Info

Publication number: CN108804542A
Application number: CN201810465352.2A
Authority: CN
Inventors: 柴磊; 原伟; 柳彦利; 杨峰; 马章焘; 王立强; 冯剑; 付斐; 郭峰; 刘改琴; 李扬; 刘晓霖
Original assignee: HEBEI GODSEND HIGH-TECH Co Ltd
Current assignee: HEBEI GODSEND HIGH-TECH Co Ltd
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2018-11-13
Anticipated expiration: 2038-05-16
Also published as: CN108804542B

Abstract

The quick obtaining file increment method based on memory operation that the invention discloses a kind of, the increment method include:Ancient deed is by specific data structure in write-in memory after row extraction characteristic value, then is inquired in memory after new file is extracted characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.The invention has the advantages that of simple structure and strong practicability.

Description

A kind of quick obtaining file increment method based on memory operation

Technical field

The present invention relates to G06F17/30, G06F17/00, G06F17, the fields G06F are especially a kind of to be transported based on memory The quick obtaining file increment method of calculation.

Background technology

During ETL, it is a very crucial operation link to obtain incremental data, and conventional method is to fill new data It is loaded into after database and carries out increment strip operation, a large amount of expensive database resources of the method consumption, arithmetic speed is slow and interface Upgrading need to remove program to the increment in database and modify, and manpower intervention is more, and exploitation pressure is larger.

Invention content

The purpose of the present invention is to solve the above problems, devise a kind of quick obtaining file increasing based on memory operation Amount method.

Realize above-mentioned purpose the technical scheme is that, a kind of quick obtaining file increment side based on memory operation Method, the increment method include：It is written in memory in specific data structure after ancient deed is extracted characteristic value by row, then will be new literary Part is inquired in memory after extracting characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.

The described increment method includes the following steps：

Step 1 chooses 2 files for needing to obtain increment.（Ancient deed hereinafter referred to as A files, new file are known as B files, increase Amount file is C files）

Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment It is reasonably selected, it is 4~6 to recommend depth value.

A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3（4~6）Position As TireTree indexes, the leaf node pointer of Tire Tree is found one by one according to index.

Step 4 creates text message node, by the hash value of each line of text and a part of original text at leaf node In this deposit text message node, wherein caHash is used for preserving hashed value, the preferably used hashing algorithm meter of size of H The character string as long as possible calculated recommends to use 16~24, and caMsg is used for preserving part urtext information, it is proposed that takes Preceding M of urtext information, the size of M recommend to use 4~8.

The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively In, text message node is subjected to tissue in the form of chained list if occurring conflicting, conflict is avoided using chained list, can be saved Partial memory can be by text message node with binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Chained list representation carries out tissue, but can increase the consumption of 10%~16% memory.

B files are calculated often capable hash value by same algorithm, use same D values by step 6（4~6）In Tire Tree Middle query node.

Step 7 exports character string S into file C.

The step of step 6 inquires node be：

（1）Read B file rows record, deposit string variable S.

（2）The hash value sHash for taking S uses same D values（4~6）As index, query text is believed in Tire Tree Breath.

（3）If the caHash in text message node is consistent with sHash, and before the caMsg and S in text message node M consistent, then can determine whether exist in A files for the S record rows in B files.

（4）It is inquired in Tire Tree and text message node according to sHash and obtains null pointer or text message section Preceding M of caHash and consistent sHash but caMsg and S in point are inconsistent, then skip the S record rows that can be identified as in B files It is not present in A files, this journey, that is, increment information.

A kind of quick obtaining file increment method based on memory operation made using technical scheme of the present invention, and it is existing There is technology to compare, the beneficial effects of the invention are as follows：Increment operation speed is fast, than fast 40 times or so of the existing operation in database, The acquisition of big data quantity increment is particularly evident, is not in the problem of increment obtains failure or repeats, memory consumption accurately and reliably Few, theoretically 1G memories support two files for being no more than 38,900,000 rows record to carry out incremental raio pair, parallel multiple tasks operation Performance will not be substantially reduced, and can solve the various drawbacks for relying on database operation increment in the prior art.

Specific implementation mode

The present invention is specifically described below in conjunction with the accompanying drawings, as shown in figures 1 to 6.

Step 1 chooses 2 files for needing to obtain increment.（Ancient deed hereinafter referred to as A files, new file are known as B texts Part, delta file are C files）

The data structure of Tire Tree is as shown in Figure 1：

A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3（4~6）Position conduct TireTree indexes find the leaf node pointer of Tire Tree according to index one by one.

TireTree is as shown in Figure 2（N=32）：

Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original Preceding M of text message, the size of M recommend to use 4~8.

As shown in Figure 3：

The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Representation carries out tissue, but can increase the consumption of 10%~16% memory.

As shown in Figure 4：

B files are calculated often capable hash value by same algorithm, use same D values by step 6（4~6）It is looked into Tire Tree Node is ask, the step of inquiring node is：

（1）Read B file rows record, deposit string variable S.

Step 7 exports character string S into file C.

It is as shown in Figure 5 to load ancient deed process flow：

The process flow for obtaining increment is as shown in Figure 6：

The characteristics of the present embodiment is that the increment method includes：By ancient deed by specific in write-in memory after row extraction characteristic value Data structure in, then by new file by same algorithm extract characteristic value after inquire in memory, algorithm, the data knot of characteristic value It the design of structure and avoids conflicting, increment operation speed is fast, than fast 40 times or so of the existing operation in database, big data quantity increment Obtain it is particularly evident, accurately and reliably, be not in increment obtain failure or repeat the problem of, memory consumption is few, theoretically in 1G It deposits and two files for being no more than 38,900,000 rows record is supported to carry out incremental raio pair, parallel multiple tasks operational performance will not obviously drop It is low, the various drawbacks for relying on database operation increment in the prior art can be solved.

Above-mentioned technical proposal only embodies the optimal technical scheme of technical solution of the present invention, those skilled in the art The principle of the present invention is embodied to some variations that some of which part may be made, belongs to the scope of protection of the present invention it It is interior.

Claims

1. a kind of quick obtaining file increment method based on memory operation, which is characterized in that the increment method includes：By old text Part by specific data structure in write-in memory after row extraction characteristic value, then by new file by same algorithm extraction characteristic value after It inquires in memory, the algorithm of characteristic value, the design of data structure and avoids conflicting.

2. a kind of quick obtaining file increment method based on memory operation according to claim 1, which is characterized in that institute The increment method is stated to include the following steps：

Step 1 chooses 2 files for needing to obtain increment,（Ancient deed hereinafter referred to as A files, new file are known as B files, increase Amount file is C files）；

Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment It is reasonably selected, it is 4~6 to recommend depth value；

A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3（4~6）Position conduct TireTree indexes find the leaf node pointer of Tire Tree according to index one by one；

Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original Preceding M of text message, the size of M recommend to use 4~8；

The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Representation carries out tissue, but can increase the consumption of 10%~16% memory；

B files are calculated often capable hash value by same algorithm, use same D values by step 6（4~6）It is looked into Tire Tree Ask node；

Step 7 exports character string S into file C.

3. a kind of quick obtaining file increment method based on memory operation according to claim 2, which is characterized in that institute Stating the step of step 6 inquires node is：

（1）Read B file rows record, deposit string variable S；

（2）The hash value sHash for taking S uses same D values（4~6）As index in Tire Tree query text information；

（3）If the caHash in text message node is consistent with sHash, and preceding M of the caMsg and S in text message node Unanimously, then it can determine whether exist in A files for the S record rows in B files；

（4）It inquires and obtains in null pointer or text message node in Tire Tree and text message node according to sHash CaHash and consistent sHash but preceding M of caMsg and S it is inconsistent, then skip and can be identified as S record rows in B files in A It is not present in file, this journey, that is, increment information.