CN108804542B - Method for rapidly acquiring file increment based on memory operation - Google Patents

Method for rapidly acquiring file increment based on memory operation Download PDF

Info

Publication number
CN108804542B
CN108804542B CN201810465352.2A CN201810465352A CN108804542B CN 108804542 B CN108804542 B CN 108804542B CN 201810465352 A CN201810465352 A CN 201810465352A CN 108804542 B CN108804542 B CN 108804542B
Authority
CN
China
Prior art keywords
file
memory
node
hash value
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810465352.2A
Other languages
Chinese (zh)
Other versions
CN108804542A (en
Inventor
柴磊
原伟
柳彦利
杨峰
马章焘
王立强
冯剑
付斐
郭峰
刘改琴
李扬
刘晓霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Godsend High Tech Co ltd
Original Assignee
Hebei Godsend High Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Godsend High Tech Co ltd filed Critical Hebei Godsend High Tech Co ltd
Priority to CN201810465352.2A priority Critical patent/CN108804542B/en
Publication of CN108804542A publication Critical patent/CN108804542A/en
Application granted granted Critical
Publication of CN108804542B publication Critical patent/CN108804542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for quickly acquiring file increment based on memory operation. The invention has the advantages of simple structure and strong practicability.

Description

Method for rapidly acquiring file increment based on memory operation
Technical Field
The invention relates to the fields of G06F17/30, G06F17/00, G06F17 and G06F, in particular to a method for quickly acquiring file increment based on memory operation.
Background
In the ETL process, obtaining incremental data is a very key operation link, and the conventional method is to carry out incremental stripping operation after new data is loaded into a database, so that a large amount of expensive database resources are consumed, the operation speed is low, an incremental stripping program in the database needs to be modified during interface upgrading, manual intervention is more, and the development pressure is higher.
Disclosure of Invention
The invention aims to solve the problems and designs a method for quickly acquiring file increment based on memory operation.
The technical scheme of the invention is that a method for rapidly acquiring file increment based on memory operation comprises the following steps: and extracting characteristic values of the old file according to rows, writing the extracted characteristic values into a specific data structure in the memory, extracting the characteristic values of the new file according to the same algorithm, and inquiring the new file in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict.
The incremental method comprises the following steps:
step one, 2 files needing to obtain the increment are selected. (the old file is hereinafter referred to as the A file, the new file is referred to as the B file, and the delta file is the C file)
And step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment, and the recommended depth value is 4-6.
And step three, taking the Hash value of the file A according to lines, selecting D (4-6) bits in the Hash value as a TireTree index according to the size of the memory and the file, and finding out leaf node pointers of the TireTree one by one according to the indexes.
And fourthly, creating a text message node at the leaf node, and storing the Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, the size of H is preferably a character string which is calculated by the adopted Hash algorithm and is as long as possible, 16-24 bits are recommended, caMsg is used for storing part of original text information, the first M bits of the original text information are recommended, and the size of M is recommended to be 4-8.
And step five, sequentially storing the Hash values of all rows of the whole file and part of the original text into the text information nodes, organizing the text information nodes in a linked list form if the conflict occurs, avoiding the conflict by adopting the linked list, saving part of the memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%.
And sixthly, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6).
And step seven, outputting the character string S to the file C.
The step six of inquiring the nodes comprises the following steps:
(1) and reading a certain line record of the B file and storing a character string variable S.
(2) And taking the Hash value sHash of the S, and using the same D value (4-6) as an index to query text information in the wire Tree.
(3) If the caHash in the text message node is consistent with the sHash, and the caMsg in the text message node is consistent with the first M bits of the S, the S record line in the B file can be judged to exist in the A file.
(4) If the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, skipping can be determined that an S record line in the B file does not exist in the A file, and the line is incremental information.
Compared with the prior art, the method for rapidly acquiring the file increment based on the memory operation, which is manufactured by the technical scheme of the invention, has the beneficial effects that: the incremental operation speed is high and is about 40 times faster than the existing operation in the database, the incremental acquisition of large data volume is obvious, accurate and reliable, the problem of failure or repetition of incremental acquisition is avoided, the memory consumption is low, theoretically, 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out incremental comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects depending on the operation increment of the database in the prior art can be overcome.
Detailed Description
The invention is described in detail below with reference to the drawings, which are shown in fig. 1-6.
Step one, 2 files needing to obtain the increment are selected. (the old file is hereinafter referred to as the A file, the new file is referred to as the B file, and the delta file is the C file)
And step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment, and the recommended depth value is 4-6.
The data structure of the wire Tree is shown in fig. 1:
and step three, taking the Hash value of the file A according to lines, selecting D (4-6) bits in the Hash value as a TireTree index according to the size of the memory and the file, and finding out leaf node pointers of the TireTree one by one according to the indexes.
TireTree is shown in fig. 2 (N = 32):
and fourthly, creating a text message node at the leaf node, and storing the Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, the size of H is preferably a character string which is calculated by the adopted Hash algorithm and is as long as possible, 16-24 bits are recommended, caMsg is used for storing part of original text information, the first M bits of the original text information are recommended, and the size of M is recommended to be 4-8.
As shown in fig. 3:
and step five, sequentially storing the Hash values of all rows of the whole file and part of the original text into the text information nodes, organizing the text information nodes in a linked list form if the conflict occurs, avoiding the conflict by adopting the linked list, saving part of the memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%.
As shown in fig. 4:
step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value (4-6), wherein the step of inquiring the nodes is as follows:
(1) and reading a certain line record of the B file and storing a character string variable S.
(2) And taking the Hash value sHash of the S, and using the same D value (4-6) as an index to query text information in the wire Tree.
(3) If the caHash in the text message node is consistent with the sHash, and the caMsg in the text message node is consistent with the first M bits of the S, the S record line in the B file can be judged to exist in the A file.
(4) If the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, skipping can be determined that an S record line in the B file does not exist in the A file, and the line is incremental information.
And step seven, outputting the character string S to the file C.
The process flow for loading the old file is shown in fig. 5:
the process flow for obtaining the increment is shown in fig. 6:
the embodiment is characterized in that the increment method comprises the following steps: the method comprises the steps of extracting characteristic values of old files according to rows, writing the extracted characteristic values into a specific data structure in a memory, extracting the characteristic values of the new files according to the same algorithm, and inquiring the new files in the memory, wherein the algorithm of the characteristic values and the design of the data structure avoid conflict, the incremental operation speed is high and is about 40 times faster than that of the existing operation in a database, the large data increment is obtained obviously, accurately and reliably, the problem of increment obtaining failure or repetition is avoided, the memory consumption is low, theoretically, the 1G memory supports two files recorded by not more than 3890 ten thousand rows to carry out increment comparison, the operation performance of a plurality of tasks in parallel is not obviously reduced, and various defects of relying on database operation increment in the prior art can be solved.
The technical solutions described above only represent the preferred technical solutions of the present invention, and some possible modifications to some parts of the technical solutions by those skilled in the art all represent the principles of the present invention, and fall within the protection scope of the present invention.

Claims (2)

1. A method for rapidly acquiring file increment based on memory operation is characterized in that,
the incremental method comprises the following steps:
selecting 2 files needing to obtain an increment;
step two, constructing a wire Tree with the depth of D of 32 forks in a memory, wherein the specific depth can be reasonably selected according to the size of a file or a hardware environment;
taking a Hash value from the file A according to lines, selecting a D bit in the Hash value as a Tire Tree index according to the size of a memory and the file, and finding out leaf node pointers of the Tire Tree one by one according to the indexes;
step four, creating a text message node at a leaf node, and storing a Hash value of each line of text and a part of original text into the text message node, wherein caHash is used for storing a Hash value, H is a character string which is calculated by the adopted Hash algorithm and is as long as possible, caMsg is used for storing a part of original text message, and the first M bits of the original text message are taken;
step five, sequentially storing the Hash values and partial original texts of all rows of the whole file into text information nodes, organizing the text information nodes in a linked list form if a conflict occurs, avoiding the conflict by adopting the linked list, saving partial memory, and if the requirement on the operation speed is high and the memory is sufficient, organizing the text information nodes in a linked list representation method of a binary tree, but increasing the consumption of the memory by 10-16%;
step six, calculating the Hash value of each row of the B file according to the same algorithm, and inquiring nodes in the wire Tree by using the same D value;
and step seven, outputting the character string S to the file C.
2. The method for rapidly acquiring the file increment based on the memory operation according to claim 1, wherein the step six of querying the node comprises the steps of:
(1) reading a certain line of record of the file B, and storing a character string variable S;
(2) taking a Hash value sHash of the S, and using the same D value as an index to query text information in the wire Tree;
(3) if the caHash and the sHash in the text message node are consistent and the caMsg in the text message node is consistent with the first M bits of the S, judging that the S record line in the B file exists in the A file;
(4) if the sHash is adopted to query in the wire Tree and the text information node to obtain a null pointer, or the caHash in the text information node is consistent with the sHash but the caMsg is not consistent with the first M bits of the S, the S record line in the B file can be determined to be absent in the A file, and the line is incremental information.
CN201810465352.2A 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation Active CN108804542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810465352.2A CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810465352.2A CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Publications (2)

Publication Number Publication Date
CN108804542A CN108804542A (en) 2018-11-13
CN108804542B true CN108804542B (en) 2021-12-07

Family

ID=64092418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465352.2A Active CN108804542B (en) 2018-05-16 2018-05-16 Method for rapidly acquiring file increment based on memory operation

Country Status (1)

Country Link
CN (1) CN108804542B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
CN101482839A (en) * 2009-02-26 2009-07-15 北京世纪互联宽带数据中心有限公司 Electronic document increment memory processing method
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison
CN101911060A (en) * 2007-12-28 2010-12-08 新叶股份有限公司 Database index key update method and program
CN102024020A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Efficient metadata memory access method in distributed file system
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
US8176018B1 (en) * 2008-04-30 2012-05-08 Netapp, Inc. Incremental file system differencing
CN103714134A (en) * 2013-12-18 2014-04-09 中国科学院计算技术研究所 Network flow data index method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9811356B2 (en) * 2015-01-30 2017-11-07 Appdynamics Llc Automated software configuration management

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101911060A (en) * 2007-12-28 2010-12-08 新叶股份有限公司 Database index key update method and program
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
US8176018B1 (en) * 2008-04-30 2012-05-08 Netapp, Inc. Incremental file system differencing
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN101482839A (en) * 2009-02-26 2009-07-15 北京世纪互联宽带数据中心有限公司 Electronic document increment memory processing method
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison
CN102024020A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Efficient metadata memory access method in distributed file system
CN103714134A (en) * 2013-12-18 2014-04-09 中国科学院计算技术研究所 Network flow data index method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"中文短文本去重方法研究";高翔;《计算机工程与应用》;20141231;第192-197页 *

Also Published As

Publication number Publication date
CN108804542A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
US20180157724A1 (en) Designating Fields in Machine Data Using Templates
CN102024046B (en) Data repeatability checking method and device as well as system
US20160210333A1 (en) Method and device for mining data regular expression
CN105677683A (en) Batch data query method and device
CN112162977B (en) MES-oriented mass data redundancy removing method and system
US9830326B2 (en) Identifying data offsets using binary masks
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN111625520A (en) Universal mapping method and system for field types of heterogeneous database
CN108804542B (en) Method for rapidly acquiring file increment based on memory operation
WO2015116762A1 (en) Optimized data condenser and method
CN115982436A (en) Efficient retrieval and compression system and compression method for stream data
JP4921453B2 (en) Bit string data sorting apparatus, method and program
US20150242453A1 (en) Information processing apparatus, computer-readable recording medium having stored therein data conversion program, and data conversion method
US20060218154A1 (en) Data processing method and data processing program
CN115712601A (en) Method for reading fixed-length files in batch based on springbatch
CN115344538A (en) Log processing method, device and equipment and readable storage medium
CN111966686B (en) Product depth tracing method based on data association model
CN110825846B (en) Data processing method and device
CN111752954A (en) Large-scale feature data storage method and device
CN112417815B (en) Dynamic coding method for class combination data in big data processing
CN110609990B (en) Method and system for editing structured data text based on artificial intelligence
JP2010020643A (en) Data file operation system and its program
CN108153813B (en) Data matching method and system
CN117807934A (en) Character string duplication eliminating method and system based on data characteristics and multi-way tree structure
CN114676289A (en) Processing method, device, terminal and storage medium of prefix tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant