CN108804542A - A kind of quick obtaining file increment method based on memory operation - Google Patents

A kind of quick obtaining file increment method based on memory operation Download PDF

Info

Publication number
CN108804542A
CN108804542A CN201810465352.2A CN201810465352A CN108804542A CN 108804542 A CN108804542 A CN 108804542A CN 201810465352 A CN201810465352 A CN 201810465352A CN 108804542 A CN108804542 A CN 108804542A
Authority
CN
China
Prior art keywords
file
memory
files
text message
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810465352.2A
Other languages
Chinese (zh)
Other versions
CN108804542B (en
Inventor
柴磊
原伟
柳彦利
杨峰
马章焘
王立强
冯剑
付斐
郭峰
刘改琴
李扬
刘晓霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HEBEI GODSEND HIGH-TECH Co Ltd
Original Assignee
HEBEI GODSEND HIGH-TECH Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HEBEI GODSEND HIGH-TECH Co Ltd filed Critical HEBEI GODSEND HIGH-TECH Co Ltd
Priority to CN201810465352.2A priority Critical patent/CN108804542B/en
Publication of CN108804542A publication Critical patent/CN108804542A/en
Application granted granted Critical
Publication of CN108804542B publication Critical patent/CN108804542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The quick obtaining file increment method based on memory operation that the invention discloses a kind of, the increment method include:Ancient deed is by specific data structure in write-in memory after row extraction characteristic value, then is inquired in memory after new file is extracted characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.The invention has the advantages that of simple structure and strong practicability.

Description

A kind of quick obtaining file increment method based on memory operation
Technical field
The present invention relates to G06F17/30, G06F17/00, G06F17, the fields G06F are especially a kind of to be transported based on memory The quick obtaining file increment method of calculation.
Background technology
During ETL, it is a very crucial operation link to obtain incremental data, and conventional method is to fill new data It is loaded into after database and carries out increment strip operation, a large amount of expensive database resources of the method consumption, arithmetic speed is slow and interface Upgrading need to remove program to the increment in database and modify, and manpower intervention is more, and exploitation pressure is larger.
Invention content
The purpose of the present invention is to solve the above problems, devise a kind of quick obtaining file increasing based on memory operation Amount method.
Realize above-mentioned purpose the technical scheme is that, a kind of quick obtaining file increment side based on memory operation Method, the increment method include:It is written in memory in specific data structure after ancient deed is extracted characteristic value by row, then will be new literary Part is inquired in memory after extracting characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.
The described increment method includes the following steps:
Step 1 chooses 2 files for needing to obtain increment.(Ancient deed hereinafter referred to as A files, new file are known as B files, increase Amount file is C files)
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment It is reasonably selected, it is 4~6 to recommend depth value.
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position As TireTree indexes, the leaf node pointer of Tire Tree is found one by one according to index.
Step 4 creates text message node, by the hash value of each line of text and a part of original text at leaf node In this deposit text message node, wherein caHash is used for preserving hashed value, the preferably used hashing algorithm meter of size of H The character string as long as possible calculated recommends to use 16~24, and caMsg is used for preserving part urtext information, it is proposed that takes Preceding M of urtext information, the size of M recommend to use 4~8.
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively In, text message node is subjected to tissue in the form of chained list if occurring conflicting, conflict is avoided using chained list, can be saved Partial memory can be by text message node with binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Chained list representation carries out tissue, but can increase the consumption of 10%~16% memory.
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)In Tire Tree Middle query node.
Step 7 exports character string S into file C.
The step of step 6 inquires node be:
(1)Read B file rows record, deposit string variable S.
(2)The hash value sHash for taking S uses same D values(4~6)As index, query text is believed in Tire Tree Breath.
(3)If the caHash in text message node is consistent with sHash, and before the caMsg and S in text message node M consistent, then can determine whether exist in A files for the S record rows in B files.
(4)It is inquired in Tire Tree and text message node according to sHash and obtains null pointer or text message section Preceding M of caHash and consistent sHash but caMsg and S in point are inconsistent, then skip the S record rows that can be identified as in B files It is not present in A files, this journey, that is, increment information.
A kind of quick obtaining file increment method based on memory operation made using technical scheme of the present invention, and it is existing There is technology to compare, the beneficial effects of the invention are as follows:Increment operation speed is fast, than fast 40 times or so of the existing operation in database, The acquisition of big data quantity increment is particularly evident, is not in the problem of increment obtains failure or repeats, memory consumption accurately and reliably Few, theoretically 1G memories support two files for being no more than 38,900,000 rows record to carry out incremental raio pair, parallel multiple tasks operation Performance will not be substantially reduced, and can solve the various drawbacks for relying on database operation increment in the prior art.
Specific implementation mode
The present invention is specifically described below in conjunction with the accompanying drawings, as shown in figures 1 to 6.
Step 1 chooses 2 files for needing to obtain increment.(Ancient deed hereinafter referred to as A files, new file are known as B texts Part, delta file are C files)
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment It is reasonably selected, it is 4~6 to recommend depth value.
The data structure of Tire Tree is as shown in Figure 1:
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position conduct TireTree indexes find the leaf node pointer of Tire Tree according to index one by one.
TireTree is as shown in Figure 2(N=32):
Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original Preceding M of text message, the size of M recommend to use 4~8.
As shown in Figure 3:
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Representation carries out tissue, but can increase the consumption of 10%~16% memory.
As shown in Figure 4:
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)It is looked into Tire Tree Node is ask, the step of inquiring node is:
(1)Read B file rows record, deposit string variable S.
(2)The hash value sHash for taking S uses same D values(4~6)As index, query text is believed in Tire Tree Breath.
(3)If the caHash in text message node is consistent with sHash, and before the caMsg and S in text message node M consistent, then can determine whether exist in A files for the S record rows in B files.
(4)It is inquired in Tire Tree and text message node according to sHash and obtains null pointer or text message section Preceding M of caHash and consistent sHash but caMsg and S in point are inconsistent, then skip the S record rows that can be identified as in B files It is not present in A files, this journey, that is, increment information.
Step 7 exports character string S into file C.
It is as shown in Figure 5 to load ancient deed process flow:
The process flow for obtaining increment is as shown in Figure 6:
The characteristics of the present embodiment is that the increment method includes:By ancient deed by specific in write-in memory after row extraction characteristic value Data structure in, then by new file by same algorithm extract characteristic value after inquire in memory, algorithm, the data knot of characteristic value It the design of structure and avoids conflicting, increment operation speed is fast, than fast 40 times or so of the existing operation in database, big data quantity increment Obtain it is particularly evident, accurately and reliably, be not in increment obtain failure or repeat the problem of, memory consumption is few, theoretically in 1G It deposits and two files for being no more than 38,900,000 rows record is supported to carry out incremental raio pair, parallel multiple tasks operational performance will not obviously drop It is low, the various drawbacks for relying on database operation increment in the prior art can be solved.
Above-mentioned technical proposal only embodies the optimal technical scheme of technical solution of the present invention, those skilled in the art The principle of the present invention is embodied to some variations that some of which part may be made, belongs to the scope of protection of the present invention it It is interior.

Claims (3)

1. a kind of quick obtaining file increment method based on memory operation, which is characterized in that the increment method includes:By old text Part by specific data structure in write-in memory after row extraction characteristic value, then by new file by same algorithm extraction characteristic value after It inquires in memory, the algorithm of characteristic value, the design of data structure and avoids conflicting.
2. a kind of quick obtaining file increment method based on memory operation according to claim 1, which is characterized in that institute The increment method is stated to include the following steps:
Step 1 chooses 2 files for needing to obtain increment,(Ancient deed hereinafter referred to as A files, new file are known as B files, increase Amount file is C files);
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment It is reasonably selected, it is 4~6 to recommend depth value;
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position conduct TireTree indexes find the leaf node pointer of Tire Tree according to index one by one;
Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original Preceding M of text message, the size of M recommend to use 4~8;
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance Representation carries out tissue, but can increase the consumption of 10%~16% memory;
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)It is looked into Tire Tree Ask node;
Step 7 exports character string S into file C.
3. a kind of quick obtaining file increment method based on memory operation according to claim 2, which is characterized in that institute Stating the step of step 6 inquires node is:
(1)Read B file rows record, deposit string variable S;
(2)The hash value sHash for taking S uses same D values(4~6)As index in Tire Tree query text information;
(3)If the caHash in text message node is consistent with sHash, and preceding M of the caMsg and S in text message node Unanimously, then it can determine whether exist in A files for the S record rows in B files;
(4)It inquires and obtains in null pointer or text message node in Tire Tree and text message node according to sHash CaHash and consistent sHash but preceding M of caMsg and S it is inconsistent, then skip and can be identified as S record rows in B files in A It is not present in file, this journey, that is, increment information.
CN201810465352.2A 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation Active CN108804542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810465352.2A CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810465352.2A CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Publications (2)

Publication Number Publication Date
CN108804542A true CN108804542A (en) 2018-11-13
CN108804542B CN108804542B (en) 2021-12-07

Family

ID=64092418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465352.2A Active CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Country Status (1)

Country Link
CN (1) CN108804542B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
CN101482839A (en) * 2009-02-26 2009-07-15 北京世纪互联宽带数据中心有限公司 Electronic document increment memory processing method
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison
CN101911060A (en) * 2007-12-28 2010-12-08 新叶股份有限公司 Database index key update method and program
CN102024020A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Efficient metadata memory access method in distributed file system
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
US8176018B1 (en) * 2008-04-30 2012-05-08 Netapp, Inc. Incremental file system differencing
CN103714134A (en) * 2013-12-18 2014-04-09 中国科学院计算技术研究所 Network flow data index method and system
US20170357524A1 (en) * 2015-01-30 2017-12-14 AppDynamics, Inc. Automated software configuration management

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101911060A (en) * 2007-12-28 2010-12-08 新叶股份有限公司 Database index key update method and program
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
US8176018B1 (en) * 2008-04-30 2012-05-08 Netapp, Inc. Incremental file system differencing
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN101482839A (en) * 2009-02-26 2009-07-15 北京世纪互联宽带数据中心有限公司 Electronic document increment memory processing method
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison
CN102024020A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Efficient metadata memory access method in distributed file system
CN103714134A (en) * 2013-12-18 2014-04-09 中国科学院计算技术研究所 Network flow data index method and system
US20170357524A1 (en) * 2015-01-30 2017-12-14 AppDynamics, Inc. Automated software configuration management

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高翔: ""中文短文本去重方法研究"", 《计算机工程与应用》 *

Also Published As

Publication number Publication date
CN108804542B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN103593436B (en) file merging method and device
CN108255647B (en) High-speed data backup method under samba server cluster
CN103649946B (en) A kind of method and its system for making file system change synchronous
CN103984753B (en) A kind of web crawlers goes the extracting method and device of multiplex eigenvalue
CN109165202A (en) A kind of preprocess method of multi-source heterogeneous big data
CN103150260B (en) Data de-duplication method and device
CA2367181A1 (en) Method for extracting information from a database
CN105989129A (en) Real-time data statistic method and device
CN107506260A (en) A kind of dynamic division database incremental backup method
CN101996250A (en) Hadoop-based mass stream data storage and query method and system
CN105868286A (en) Parallel adding method and system for merging small files on basis of distributed file system
CN103761236A (en) Incremental frequent pattern increase data mining method
CN104809182A (en) Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN109471905A (en) A kind of block chain index method for supporting time range and range of attributes compound query
US20170249218A1 (en) Data to be backed up in a backup system
CN104572679A (en) Public opinion data storage method and device
CN106557777A (en) It is a kind of to be based on the improved Kmeans clustering methods of SimHash
CN104951403B (en) A kind of cold and hot data identification method of low overhead and zero defect
CN102567313B (en) Progressive webpage library deduplication system and its implementation
CN104778164A (en) Method and device for detecting repeated URL (Uniform Resource Locator)
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN107423321B (en) Method and device suitable for cloud storage of large-batch small files
CN108182198A (en) Store the control device and read method of Dynamic matrix control device operation data
CN103984723A (en) Method used for updating data mining for frequent item by incremental data
CN108804542A (en) A kind of quick obtaining file increment method based on memory operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant