CN108804542A - A kind of quick obtaining file increment method based on memory operation - Google Patents
A kind of quick obtaining file increment method based on memory operation Download PDFInfo
- Publication number
- CN108804542A CN108804542A CN201810465352.2A CN201810465352A CN108804542A CN 108804542 A CN108804542 A CN 108804542A CN 201810465352 A CN201810465352 A CN 201810465352A CN 108804542 A CN108804542 A CN 108804542A
- Authority
- CN
- China
- Prior art keywords
- file
- memory
- files
- text message
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The quick obtaining file increment method based on memory operation that the invention discloses a kind of, the increment method include:Ancient deed is by specific data structure in write-in memory after row extraction characteristic value, then is inquired in memory after new file is extracted characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.The invention has the advantages that of simple structure and strong practicability.
Description
Technical field
The present invention relates to G06F17/30, G06F17/00, G06F17, the fields G06F are especially a kind of to be transported based on memory
The quick obtaining file increment method of calculation.
Background technology
During ETL, it is a very crucial operation link to obtain incremental data, and conventional method is to fill new data
It is loaded into after database and carries out increment strip operation, a large amount of expensive database resources of the method consumption, arithmetic speed is slow and interface
Upgrading need to remove program to the increment in database and modify, and manpower intervention is more, and exploitation pressure is larger.
Invention content
The purpose of the present invention is to solve the above problems, devise a kind of quick obtaining file increasing based on memory operation
Amount method.
Realize above-mentioned purpose the technical scheme is that, a kind of quick obtaining file increment side based on memory operation
Method, the increment method include:It is written in memory in specific data structure after ancient deed is extracted characteristic value by row, then will be new literary
Part is inquired in memory after extracting characteristic value by same algorithm, the algorithm of characteristic value, the design of data structure and avoids conflicting.
The described increment method includes the following steps:
Step 1 chooses 2 files for needing to obtain increment.(Ancient deed hereinafter referred to as A files, new file are known as B files, increase
Amount file is C files)
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment
It is reasonably selected, it is 4~6 to recommend depth value.
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position
As TireTree indexes, the leaf node pointer of Tire Tree is found one by one according to index.
Step 4 creates text message node, by the hash value of each line of text and a part of original text at leaf node
In this deposit text message node, wherein caHash is used for preserving hashed value, the preferably used hashing algorithm meter of size of H
The character string as long as possible calculated recommends to use 16~24, and caMsg is used for preserving part urtext information, it is proposed that takes
Preceding M of urtext information, the size of M recommend to use 4~8.
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively
In, text message node is subjected to tissue in the form of chained list if occurring conflicting, conflict is avoided using chained list, can be saved
Partial memory can be by text message node with binary tree such as to arithmetic speed requirement height and under the premise of memory abundance
Chained list representation carries out tissue, but can increase the consumption of 10%~16% memory.
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)In Tire Tree
Middle query node.
Step 7 exports character string S into file C.
The step of step 6 inquires node be:
(1)Read B file rows record, deposit string variable S.
(2)The hash value sHash for taking S uses same D values(4~6)As index, query text is believed in Tire Tree
Breath.
(3)If the caHash in text message node is consistent with sHash, and before the caMsg and S in text message node
M consistent, then can determine whether exist in A files for the S record rows in B files.
(4)It is inquired in Tire Tree and text message node according to sHash and obtains null pointer or text message section
Preceding M of caHash and consistent sHash but caMsg and S in point are inconsistent, then skip the S record rows that can be identified as in B files
It is not present in A files, this journey, that is, increment information.
A kind of quick obtaining file increment method based on memory operation made using technical scheme of the present invention, and it is existing
There is technology to compare, the beneficial effects of the invention are as follows:Increment operation speed is fast, than fast 40 times or so of the existing operation in database,
The acquisition of big data quantity increment is particularly evident, is not in the problem of increment obtains failure or repeats, memory consumption accurately and reliably
Few, theoretically 1G memories support two files for being no more than 38,900,000 rows record to carry out incremental raio pair, parallel multiple tasks operation
Performance will not be substantially reduced, and can solve the various drawbacks for relying on database operation increment in the prior art.
Specific implementation mode
The present invention is specifically described below in conjunction with the accompanying drawings, as shown in figures 1 to 6.
Step 1 chooses 2 files for needing to obtain increment.(Ancient deed hereinafter referred to as A files, new file are known as B texts
Part, delta file are C files)
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment
It is reasonably selected, it is 4~6 to recommend depth value.
The data structure of Tire Tree is as shown in Figure 1:
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position conduct
TireTree indexes find the leaf node pointer of Tire Tree according to index one by one.
TireTree is as shown in Figure 2(N=32):
Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited
Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates
Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original
Preceding M of text message, the size of M recommend to use 4~8.
As shown in Figure 3:
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if
Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part
Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance
Representation carries out tissue, but can increase the consumption of 10%~16% memory.
As shown in Figure 4:
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)It is looked into Tire Tree
Node is ask, the step of inquiring node is:
(1)Read B file rows record, deposit string variable S.
(2)The hash value sHash for taking S uses same D values(4~6)As index, query text is believed in Tire Tree
Breath.
(3)If the caHash in text message node is consistent with sHash, and before the caMsg and S in text message node
M consistent, then can determine whether exist in A files for the S record rows in B files.
(4)It is inquired in Tire Tree and text message node according to sHash and obtains null pointer or text message section
Preceding M of caHash and consistent sHash but caMsg and S in point are inconsistent, then skip the S record rows that can be identified as in B files
It is not present in A files, this journey, that is, increment information.
Step 7 exports character string S into file C.
It is as shown in Figure 5 to load ancient deed process flow:
The process flow for obtaining increment is as shown in Figure 6:
The characteristics of the present embodiment is that the increment method includes:By ancient deed by specific in write-in memory after row extraction characteristic value
Data structure in, then by new file by same algorithm extract characteristic value after inquire in memory, algorithm, the data knot of characteristic value
It the design of structure and avoids conflicting, increment operation speed is fast, than fast 40 times or so of the existing operation in database, big data quantity increment
Obtain it is particularly evident, accurately and reliably, be not in increment obtain failure or repeat the problem of, memory consumption is few, theoretically in 1G
It deposits and two files for being no more than 38,900,000 rows record is supported to carry out incremental raio pair, parallel multiple tasks operational performance will not obviously drop
It is low, the various drawbacks for relying on database operation increment in the prior art can be solved.
Above-mentioned technical proposal only embodies the optimal technical scheme of technical solution of the present invention, those skilled in the art
The principle of the present invention is embodied to some variations that some of which part may be made, belongs to the scope of protection of the present invention it
It is interior.
Claims (3)
1. a kind of quick obtaining file increment method based on memory operation, which is characterized in that the increment method includes:By old text
Part by specific data structure in write-in memory after row extraction characteristic value, then by new file by same algorithm extraction characteristic value after
It inquires in memory, the algorithm of characteristic value, the design of data structure and avoids conflicting.
2. a kind of quick obtaining file increment method based on memory operation according to claim 1, which is characterized in that institute
The increment method is stated to include the following steps:
Step 1 chooses 2 files for needing to obtain increment,(Ancient deed hereinafter referred to as A files, new file are known as B files, increase
Amount file is C files);
Step 2 builds the Tire Tree that 32 fork depth are D in memory, and specific depth can be by file size or hardware environment
It is reasonably selected, it is 4~6 to recommend depth value;
A files are taken hash value by row, the D in hash value are chosen according to memory and file size by step 3(4~6)Position conduct
TireTree indexes find the leaf node pointer of Tire Tree according to index one by one;
Step 4 creates text message node at leaf node, the hash value of each line of text and a part of urtext is deposited
Enter in text message node, wherein caHash is used for preserving hashed value, and the preferably used hashing algorithm of size of H calculates
Character string as long as possible, recommend to use 16~24, caMsg is used for preserving part urtext information, it is proposed that takes original
Preceding M of text message, the size of M recommend to use 4~8;
The hash value of each row of entire file and part urtext are sequentially stored into text message node by step 5 successively, if
Appearance conflicts, and text message node is carried out tissue in the form of chained list, conflict is avoided using chained list, can save part
Memory can be by text message node with the chained list of binary tree such as to arithmetic speed requirement height and under the premise of memory abundance
Representation carries out tissue, but can increase the consumption of 10%~16% memory;
B files are calculated often capable hash value by same algorithm, use same D values by step 6(4~6)It is looked into Tire Tree
Ask node;
Step 7 exports character string S into file C.
3. a kind of quick obtaining file increment method based on memory operation according to claim 2, which is characterized in that institute
Stating the step of step 6 inquires node is:
(1)Read B file rows record, deposit string variable S;
(2)The hash value sHash for taking S uses same D values(4~6)As index in Tire Tree query text information;
(3)If the caHash in text message node is consistent with sHash, and preceding M of the caMsg and S in text message node
Unanimously, then it can determine whether exist in A files for the S record rows in B files;
(4)It inquires and obtains in null pointer or text message node in Tire Tree and text message node according to sHash
CaHash and consistent sHash but preceding M of caMsg and S it is inconsistent, then skip and can be identified as S record rows in B files in A
It is not present in file, this journey, that is, increment information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810465352.2A CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810465352.2A CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108804542A true CN108804542A (en) | 2018-11-13 |
CN108804542B CN108804542B (en) | 2021-12-07 |
Family
ID=64092418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810465352.2A Active CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804542B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101271466A (en) * | 2008-04-30 | 2008-09-24 | 中山大学 | Electronic dictionary work retrieval method based on self-adapting dictionary tree |
CN101482839A (en) * | 2009-02-26 | 2009-07-15 | 北京世纪互联宽带数据中心有限公司 | Electronic document increment memory processing method |
CN101807207A (en) * | 2010-03-22 | 2010-08-18 | 北京大用科技有限责任公司 | Method for sharing document based on content difference comparison |
CN101911060A (en) * | 2007-12-28 | 2010-12-08 | 新叶股份有限公司 | Database index key update method and program |
CN102024020A (en) * | 2010-11-04 | 2011-04-20 | 曙光信息产业(北京)有限公司 | Efficient metadata memory access method in distributed file system |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
US8176018B1 (en) * | 2008-04-30 | 2012-05-08 | Netapp, Inc. | Incremental file system differencing |
CN103714134A (en) * | 2013-12-18 | 2014-04-09 | 中国科学院计算技术研究所 | Network flow data index method and system |
US20170357524A1 (en) * | 2015-01-30 | 2017-12-14 | AppDynamics, Inc. | Automated software configuration management |
-
2018
- 2018-05-16 CN CN201810465352.2A patent/CN108804542B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101911060A (en) * | 2007-12-28 | 2010-12-08 | 新叶股份有限公司 | Database index key update method and program |
CN101271466A (en) * | 2008-04-30 | 2008-09-24 | 中山大学 | Electronic dictionary work retrieval method based on self-adapting dictionary tree |
US8176018B1 (en) * | 2008-04-30 | 2012-05-08 | Netapp, Inc. | Incremental file system differencing |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
CN101482839A (en) * | 2009-02-26 | 2009-07-15 | 北京世纪互联宽带数据中心有限公司 | Electronic document increment memory processing method |
CN101807207A (en) * | 2010-03-22 | 2010-08-18 | 北京大用科技有限责任公司 | Method for sharing document based on content difference comparison |
CN102024020A (en) * | 2010-11-04 | 2011-04-20 | 曙光信息产业(北京)有限公司 | Efficient metadata memory access method in distributed file system |
CN103714134A (en) * | 2013-12-18 | 2014-04-09 | 中国科学院计算技术研究所 | Network flow data index method and system |
US20170357524A1 (en) * | 2015-01-30 | 2017-12-14 | AppDynamics, Inc. | Automated software configuration management |
Non-Patent Citations (1)
Title |
---|
高翔: ""中文短文本去重方法研究"", 《计算机工程与应用》 * |
Also Published As
Publication number | Publication date |
---|---|
CN108804542B (en) | 2021-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103593436B (en) | file merging method and device | |
CN108255647B (en) | High-speed data backup method under samba server cluster | |
CN103649946B (en) | A kind of method and its system for making file system change synchronous | |
CN103984753B (en) | A kind of web crawlers goes the extracting method and device of multiplex eigenvalue | |
CN109165202A (en) | A kind of preprocess method of multi-source heterogeneous big data | |
CN103150260B (en) | Data de-duplication method and device | |
CA2367181A1 (en) | Method for extracting information from a database | |
CN105989129A (en) | Real-time data statistic method and device | |
CN107506260A (en) | A kind of dynamic division database incremental backup method | |
CN101996250A (en) | Hadoop-based mass stream data storage and query method and system | |
CN105868286A (en) | Parallel adding method and system for merging small files on basis of distributed file system | |
CN103761236A (en) | Incremental frequent pattern increase data mining method | |
CN104809182A (en) | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) | |
CN109471905A (en) | A kind of block chain index method for supporting time range and range of attributes compound query | |
US20170249218A1 (en) | Data to be backed up in a backup system | |
CN104572679A (en) | Public opinion data storage method and device | |
CN106557777A (en) | It is a kind of to be based on the improved Kmeans clustering methods of SimHash | |
CN104951403B (en) | A kind of cold and hot data identification method of low overhead and zero defect | |
CN102567313B (en) | Progressive webpage library deduplication system and its implementation | |
CN104778164A (en) | Method and device for detecting repeated URL (Uniform Resource Locator) | |
CN104408128B (en) | A kind of reading optimization method indexed based on B+ trees asynchronous refresh | |
CN107423321B (en) | Method and device suitable for cloud storage of large-batch small files | |
CN108182198A (en) | Store the control device and read method of Dynamic matrix control device operation data | |
CN103984723A (en) | Method used for updating data mining for frequent item by incremental data | |
CN108804542A (en) | A kind of quick obtaining file increment method based on memory operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |