CN108804542B - Method for rapidly acquiring file increment based on memory operation - Google Patents
Method for rapidly acquiring file increment based on memory operation Download PDFInfo
- Publication number
- CN108804542B CN108804542B CN201810465352.2A CN201810465352A CN108804542B CN 108804542 B CN108804542 B CN 108804542B CN 201810465352 A CN201810465352 A CN 201810465352A CN 108804542 B CN108804542 B CN 108804542B
- Authority
- CN
- China
- Prior art keywords
- file
- memory
- node
- hash value
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for quickly acquiring file increment based on memory operation. The invention has the advantages of simple structure and strong practicability.
Description
Technical Field
The invention relates to the fields of G06F17/30, G06F17/00, G06F17 and G06F, in particular to a method for quickly acquiring file increment based on memory operation.
Background
In the ETL process, obtaining incremental data is a very key operation link, and the conventional method is to carry out incremental stripping operation after new data is loaded into a database, so that a large amount of expensive database resources are consumed, the operation speed is low, an incremental stripping program in the database needs to be modified during interface upgrading, manual intervention is more, and the development pressure is higher.
Disclosure of Invention
The invention aims to solve the problems and designs a method for quickly acquiring file increment based on memory operation.
The technical scheme of the invention is that a method for rapidly acquiring file increment based on memory operation comprises the following steps: and extracting characteristic values of the old file according to rows, writing the extracted characteristic values into a specific data structure in the memory, extracting the characteristic values of the new file according to the same algorithm, and inquiring the new file in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict.
The incremental method comprises the following steps:
step one, 2 files needing to obtain the increment are selected. (the old file is hereinafter referred to as the A file, the new file is referred to as the B file, and the delta file is the C file)
And step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment, and the recommended depth value is 4-6.
And step three, taking the Hash value of the file A according to lines, selecting D (4-6) bits in the Hash value as a TireTree index according to the size of the memory and the file, and finding out leaf node pointers of the TireTree one by one according to the indexes.
And fourthly, creating a text message node at the leaf node, and storing the Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, the size of H is preferably a character string which is calculated by the adopted Hash algorithm and is as long as possible, 16-24 bits are recommended, caMsg is used for storing part of original text information, the first M bits of the original text information are recommended, and the size of M is recommended to be 4-8.
And step five, sequentially storing the Hash values of all rows of the whole file and part of the original text into the text information nodes, organizing the text information nodes in a linked list form if the conflict occurs, avoiding the conflict by adopting the linked list, saving part of the memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%.
And sixthly, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6).
And step seven, outputting the character string S to the file C.
The step six of inquiring the nodes comprises the following steps:
(1) and reading a certain line record of the B file and storing a character string variable S.
(2) And taking the Hash value sHash of the S, and using the same D value (4-6) as an index to query text information in the wire Tree.
(3) If the caHash in the text message node is consistent with the sHash, and the caMsg in the text message node is consistent with the first M bits of the S, the S record line in the B file can be judged to exist in the A file.
(4) If the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, skipping can be determined that an S record line in the B file does not exist in the A file, and the line is incremental information.
Compared with the prior art, the method for rapidly acquiring the file increment based on the memory operation, which is manufactured by the technical scheme of the invention, has the beneficial effects that: the incremental operation speed is high and is about 40 times faster than the existing operation in the database, the incremental acquisition of large data volume is obvious, accurate and reliable, the problem of failure or repetition of incremental acquisition is avoided, the memory consumption is low, theoretically, 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out incremental comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects depending on the operation increment of the database in the prior art can be overcome.
Detailed Description
The invention is described in detail below with reference to the drawings, which are shown in fig. 1-6.
Step one, 2 files needing to obtain the increment are selected. (the old file is hereinafter referred to as the A file, the new file is referred to as the B file, and the delta file is the C file)
And step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment, and the recommended depth value is 4-6.
The data structure of the wire Tree is shown in fig. 1:
and step three, taking the Hash value of the file A according to lines, selecting D (4-6) bits in the Hash value as a TireTree index according to the size of the memory and the file, and finding out leaf node pointers of the TireTree one by one according to the indexes.
TireTree is shown in fig. 2 (N = 32):
and fourthly, creating a text message node at the leaf node, and storing the Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, the size of H is preferably a character string which is calculated by the adopted Hash algorithm and is as long as possible, 16-24 bits are recommended, caMsg is used for storing part of original text information, the first M bits of the original text information are recommended, and the size of M is recommended to be 4-8.
As shown in fig. 3:
and step five, sequentially storing the Hash values of all rows of the whole file and part of the original text into the text information nodes, organizing the text information nodes in a linked list form if the conflict occurs, avoiding the conflict by adopting the linked list, saving part of the memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%.
As shown in fig. 4:
step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6), wherein the step of inquiring the nodes is as follows:
(1) and reading a certain line record of the B file and storing a character string variable S.
(2) And taking the Hash value sHash of the S, and using the same D value (4-6) as an index to query text information in the wire Tree.
(3) If the caHash in the text message node is consistent with the sHash, and the caMsg in the text message node is consistent with the first M bits of the S, the S record line in the B file can be judged to exist in the A file.
(4) If the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, skipping can be determined that an S record line in the B file does not exist in the A file, and the line is incremental information.
And step seven, outputting the character string S to the file C.
The process flow for loading the old file is shown in fig. 5:
the process flow for obtaining the increment is shown in fig. 6:
the embodiment is characterized in that the increment method comprises the following steps: the method comprises the steps of extracting characteristic values of old files according to rows, writing the extracted characteristic values into a specific data structure in a memory, extracting the characteristic values of the new files according to the same algorithm, and inquiring the new files in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict, the incremental operation speed is high and is about 40 times faster than that of the existing operation in a database, the large data increment is obtained obviously, accurately and reliably, the problem of increment obtaining failure or repetition is avoided, the memory consumption is low, theoretically, the 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out increment comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects of relying on database operation increment in the prior art can be solved.
The technical solutions described above only represent the preferred technical solutions of the present invention, and some possible modifications to some parts of the technical solutions by those skilled in the art all represent the principles of the present invention, and fall within the protection scope of the present invention.
Claims (2)
1. A method for rapidly acquiring file increment based on memory operation is characterized in that,
the incremental method comprises the following steps:
selecting 2 files needing to obtain an increment;
step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment;
taking a Hash value from the file A according to lines, selecting a D bit in the Hash value as a Tire Tree index according to the size of a memory and the file, and finding out leaf node pointers of the Tire Tree one by one according to the indexes;
step four, creating a text message node at a leaf node, and storing a Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, H is a character string which is calculated by the adopted Hash algorithm and is as long as possible, caMsg is used for storing a part of original text message, and the first M bits of the original text message are taken;
step five, sequentially storing the Hash values and partial original texts of all rows of the whole file into text information nodes, organizing the text information nodes in a linked list form if a conflict occurs, avoiding the conflict by adopting the linked list, saving partial memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%;
step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value;
and step seven, outputting the character string S to the file C.
2. The method for rapidly acquiring the file increment based on the memory operation according to claim 1, wherein the step six of querying the node comprises the steps of:
(1) reading a certain line of record of the file B, and storing a character string variable S;
(2) taking a Hash value sHash of the S, and using the same D value as an index to query text information in the wire Tree;
(3) if the caHash and the sHash in the text message node are consistent and the caMsg in the text message node is consistent with the first M bits of the S, judging that the S record line in the B file exists in the A file;
(4) if the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, the S record line in the B file can be determined to be absent in the A file, and the line is incremental information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810465352.2A CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810465352.2A CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108804542A CN108804542A (en) | 2018-11-13 |
CN108804542B true CN108804542B (en) | 2021-12-07 |
Family
ID=64092418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810465352.2A Active CN108804542B (en) | 2018-05-16 | 2018-05-16 | Method for rapidly acquiring file increment based on memory operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804542B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101271466A (en) * | 2008-04-30 | 2008-09-24 | 中山大学 | Electronic dictionary work retrieval method based on self-adapting dictionary tree |
CN101482839A (en) * | 2009-02-26 | 2009-07-15 | 北京世纪互联宽带数据中心有限公司 | Electronic document increment memory processing method |
CN101807207A (en) * | 2010-03-22 | 2010-08-18 | 北京大用科技有限责任公司 | Method for sharing document based on content difference comparison |
CN101911060A (en) * | 2007-12-28 | 2010-12-08 | 新叶股份有限公司 | Database index key update method and program |
CN102024020A (en) * | 2010-11-04 | 2011-04-20 | 曙光信息产业(北京)有限公司 | Efficient metadata memory access method in distributed file system |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
US8176018B1 (en) * | 2008-04-30 | 2012-05-08 | Netapp, Inc. | Incremental file system differencing |
CN103714134A (en) * | 2013-12-18 | 2014-04-09 | 中国科学院计算技术研究所 | Network flow data index method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9811356B2 (en) * | 2015-01-30 | 2017-11-07 | Appdynamics Llc | Automated software configuration management |
-
2018
- 2018-05-16 CN CN201810465352.2A patent/CN108804542B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101911060A (en) * | 2007-12-28 | 2010-12-08 | 新叶股份有限公司 | Database index key update method and program |
CN101271466A (en) * | 2008-04-30 | 2008-09-24 | 中山大学 | Electronic dictionary work retrieval method based on self-adapting dictionary tree |
US8176018B1 (en) * | 2008-04-30 | 2012-05-08 | Netapp, Inc. | Incremental file system differencing |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
CN101482839A (en) * | 2009-02-26 | 2009-07-15 | 北京世纪互联宽带数据中心有限公司 | Electronic document increment memory processing method |
CN101807207A (en) * | 2010-03-22 | 2010-08-18 | 北京大用科技有限责任公司 | Method for sharing document based on content difference comparison |
CN102024020A (en) * | 2010-11-04 | 2011-04-20 | 曙光信息产业(北京)有限公司 | Efficient metadata memory access method in distributed file system |
CN103714134A (en) * | 2013-12-18 | 2014-04-09 | 中国科学院计算技术研究所 | Network flow data index method and system |
Non-Patent Citations (1)
Title |
---|
"中文短文本去重方法研究";高翔;《计算机工程与应用》;20141231;第192-197页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108804542A (en) | 2018-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20180157724A1 (en) | Designating Fields in Machine Data Using Templates | |
CN102024046B (en) | Data repeatability checking method and device as well as system | |
US20160210333A1 (en) | Method and device for mining data regular expression | |
CN105677683A (en) | Batch data query method and device | |
CN112162977B (en) | MES-oriented mass data redundancy removing method and system | |
US9830326B2 (en) | Identifying data offsets using binary masks | |
CN104408128B (en) | A kind of reading optimization method indexed based on B+ trees asynchronous refresh | |
CN111625520A (en) | Universal mapping method and system for field types of heterogeneous database | |
CN108804542B (en) | Method for rapidly acquiring file increment based on memory operation | |
WO2015116762A1 (en) | Optimized data condenser and method | |
CN115982436A (en) | Efficient retrieval and compression system and compression method for stream data | |
JP4921453B2 (en) | Bit string data sorting apparatus, method and program | |
US20150242453A1 (en) | Information processing apparatus, computer-readable recording medium having stored therein data conversion program, and data conversion method | |
US20060218154A1 (en) | Data processing method and data processing program | |
CN115712601A (en) | Method for reading fixed-length files in batch based on springbatch | |
CN115344538A (en) | Log processing method, device and equipment and readable storage medium | |
CN111966686B (en) | Product depth tracing method based on data association model | |
CN110825846B (en) | Data processing method and device | |
CN111752954A (en) | Large-scale feature data storage method and device | |
CN112417815B (en) | Dynamic coding method for class combination data in big data processing | |
CN110609990B (en) | Method and system for editing structured data text based on artificial intelligence | |
JP2010020643A (en) | Data file operation system and its program | |
CN108153813B (en) | Data matching method and system | |
CN117807934A (en) | Character string duplication eliminating method and system based on data characteristics and multi-way tree structure | |
CN114676289A (en) | Processing method, device, terminal and storage medium of prefix tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |