CN117971826A - Construction method and construction device of large data linked list with verification function - Google Patents

Construction method and construction device of large data linked list with verification function Download PDF

Info

Publication number
CN117971826A
CN117971826A CN202410036277.3A CN202410036277A CN117971826A CN 117971826 A CN117971826 A CN 117971826A CN 202410036277 A CN202410036277 A CN 202410036277A CN 117971826 A CN117971826 A CN 117971826A
Authority
CN
China
Prior art keywords
suffix
type
suffixes
bucket
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410036277.3A
Other languages
Chinese (zh)
Inventor
韩凌波
冯卓文
李晓玉
冯天心
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202410036277.3A priority Critical patent/CN117971826A/en
Publication of CN117971826A publication Critical patent/CN117971826A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method and a device for constructing a large data linked list with a verification function, which relate to the technical field of computers and comprise the following steps: according to the size of the equipment memory, the character strings and the array arrays thereof are segmented, the S type suffixes are ordered by recursively using the induction ordering of the external memory segmentation, and ascending and descending fingerprint values of the S type suffixes are calculated; the same ordering method is adopted, the L-type suffixes are ordered to corresponding suffix barrel blocks according to the sequence of the S-type suffixes, and the L-type suffixes are stored to an external memory; sorting the S type suffixes to the corresponding suffix barrel blocks according to the sequence of the L type suffixes, and calculating fingerprint values of the descending S type suffixes; acquiring each suffix from L and S type suffix barrel blocks according to dictionary sequence by adopting multi-path merging and sorting, calculating pointer fields and suffix chain table blocks of the suffixes, and increasing fingerprint values of S type suffixes; and sorting and merging the suffix linked list blocks according to the positions to generate a final linked list, and finally comparing ascending and descending fingerprint values to give a verification result.

Description

Construction method and construction device of large data linked list with verification function
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for constructing a large data linked list with a verification function.
Background
The suffix chain table is a single chain table formed by linking ordered suffixes according to positions and is mainly applied to the field of data compression. Given a character string T with a length of n, the corresponding suffix list ψ (T) is an integer array with a length of n+1, the first element ψ [0] is the list head, points to the minimum suffix of T, and the other elements point to the next suffix position larger than the current suffix dictionary sequence, and the whole suffix list can be traversed according to ψ [0 ]. The correctness of the large data suffix linked list directly determines the correctness and the compression efficiency of the compression process, so that verifying the correctness of the large data suffix linked list is an indispensable step for large data compression.
The existing suffix linked list correctness verification technology mainly aims at small-sized data and comprises two methods: (1) The size relation of adjacent suffixes is successively compared in the memory by traversing the whole linked list; (2) And integrating correctness verification into the construction process of the linked list, and simultaneously performing calculation and verification. However, for large data exceeding the device memory, the character string T and its linked list are all stored in the external memory, and the two methods are not applicable. In theory, adjacent suffixes in the suffix chain table can be read from the character string T according to the suffix chain table pointing relation, and the adjacent suffixes are compared one by one, but the method needs a large amount of random reading operation on the external memory, and particularly for large-scale data, the random reading operation on the external memory can take a large amount of time, so that the calculation efficiency is low.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method and a device for constructing a large data linked list with a verification function, which are used for verifying the correctness of the linked list in the process of constructing the large data linked list, giving a verification result while generating the linked list, and improving the calculation efficiency of the correctness verification of the large data linked list without independently verifying the correctness.
In order to achieve the above object, the present invention provides the following technical solutions: in one aspect, the invention provides a method for constructing a large data link table with a verification function, which comprises the following steps:
Reading a character string T from the external memory from the right to the left, searching the position where the S type character appears by comparing the dictionary sequences of adjacent characters, and logically dividing the character string T into a plurality of blocks by taking the S type character as a separator according to the size of the equipment memory;
scanning the character string T, counting the number of elements in each suffix barrel, and dividing the suffix barrel into different suffix barrel blocks according to the size of the memory of the equipment, wherein each block can be subjected to induction sequencing in the memory; the suffix barrel block consists of continuous suffix barrels and is stored in an external data partition; the suffix bucket is composed of suffixes with the same initial letters in the suffix array;
Recursively adopting external memory block induction sequencing to calculate the sequence of the S-type suffixes of the character string T, and iteratively calculating fingerprint values fp1 and fp2 of the ascending and descending S-type suffixes by using a fingerprint function;
according to the ordered S type suffixes, calculating the L type of each block of the character string T and the preceding character of the S type suffixes, and sequentially storing the L type of each block of the character string T and the preceding block of the S type corresponding to each block of the character string T; the L and S type preceding blocks consist of preceding characters of L and S type suffixes, and are respectively stored in the external data blocks;
traversing S-type suffixes and L-type successor blocks, adopting a small top heap, using suffix head characters and sequence numbers as ordering keywords, ordering the L-type suffixes to L-type suffix barrel blocks to which the L-type suffixes belong, and sequentially storing the L-type suffixes to a memory;
Traversing L-type suffixes and S-type preceding blocks in a descending order, adopting a large top heap, ordering the S-type suffixes to an S-type suffix barrel block to which the S-type suffixes belong by taking suffix head characters and sequence numbers as ordering keywords, and iteratively calculating fingerprint values fp3 of the descending S-type suffixes by using a fingerprint function;
Sequentially acquiring each suffix sequence from L and S type suffix barrel blocks according to the dictionary sequence by adopting multi-path merging and sorting, calculating a pointer domain of the suffix and an affiliated suffix linked list block according to the position information of adjacent suffixes, sorting the suffix linked list blocks according to the positions, merging each suffix linked list block to generate a final linked list, and iteratively calculating a fingerprint value fp4 of the ascending S type suffix by using a fingerprint function during the period;
And comparing the ascending order and the descending order fingerprint values of the S-type suffix, and giving out a final generated suffix linked list correctness verification result.
On the other hand, the invention also provides a device for constructing a large data link list with a verification function, which comprises the following steps: the character string preprocessing module is used for dividing the character string T into a plurality of blocks and dividing the suffix barrel into a plurality of suffix barrel blocks; the S-type suffix ordering module is used for ordering the S-type suffixes of the character string T and calculating fingerprint values of the ascending and descending S-type suffixes; the suffix prefix character calculation module is used for calculating L and S type suffix prefix characters of each block of the character string T and storing the L and S type suffix prefix characters into the corresponding preceding blocks of the external memory; the L-type suffix ordering module calculates the sequence of the L-type suffixes according to the sequence of the S-type suffixes, and stores the L-type suffixes into an external memory; the S type suffix ordering module calculates the sequence of the S type suffixes according to the sequence of the L type suffixes, and calculates fingerprint values of descending S type suffixes; the suffix link module links the suffixes in the L and S type suffix barrel blocks to generate a final suffix linked list, and calculates fingerprint values of the ascending S type suffixes; and the verification module is used for verifying the correctness of the suffix linked list by comparing the fingerprint values of the suffixes of the ascending order and the descending order.
In still another aspect, the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements a method for constructing a large data link table with a self-authentication function according to the first aspect when executing the computer program.
Compared with the prior art, the invention has the beneficial effects that:
the electronic equipment provided by the invention firstly sorts the S-type suffixes, calculates ascending and descending fingerprint values of the S-type suffixes, deduces the sequence of the L-type suffixes and the S-type suffixes according to the S-type suffixes, combines the two to generate a final linked list, calculates the fingerprint values of the ascending and descending S-type suffixes during the period, and finally verifies the correctness of the linked list by comparing the difference of the fingerprint values. In the application scene of calculating LZ77 factor decomposition of large data, the method for constructing the large data linked list with the verification function provided by the embodiment can be adopted, so that the verification of correctness is integrated into the linked list construction process, independent verification is not needed, and the time cost of the verification of correctness is obviously reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a schematic diagram of a suffix array and a suffix linked list storage structure of a character string T according to an embodiment of the present application;
fig. 2 is a schematic diagram of a suffix bucket and a suffix bucket block of a character string T according to an embodiment of the present application;
Fig. 3 is a schematic block diagram of a suffix linked list of a character string T according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for constructing a large data link table with a verification function according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a construction apparatus of a large data link table with verification function according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
First, technical terms possibly used in the embodiments of the present application are described herein in a unified manner.
Character string T: consists of n characters, i.e., T [1, n ] = T [1] T [2]. T [ n ], where T [ n ] = $' is the unique minimum character; the substring T [ i, n ] starting at position i represents the suffix of T starting at i, denoted by suf (T, i).
Character, suffix type: including L-type and S-type. If the character T [ i ] > T [ i+1], or T [ i ] = =t [ i+1] and T [ i+1] is of L type, then T [ i ] is of L type, otherwise is of S type. If Ti is L type and Ti-1 is S type, ti is L type; if Ti is S type and Ti-1 is L type, ti is S type; the L and S types are subtypes of the L and S types, respectively. The type of the suffix is the same as the type of the suffix head character.
S type substring: and the substring which is intercepted by two adjacent S-type characters in the character string T comprises S-type characters from beginning to end. In an S-type substring, substrings formed of S-type characters starting with a character (except for the first character) to ending are called L-and S-type substrings, the type of which is the same as the type of the first character of the substring.
Suffix array SA (T): and storing the starting position of the ascending suffix from left to right in the integer array with the same length as T.
Suffix linked list ψ (T): integer arrays of length n+1, ψ [0] being the linked list head, point to the minimum suffix of T, if i+.n, then ψ [ SA [ i ] ] =SA [ i+1], otherwise ψ [ SA [ n ] ] =0, i.e. the maximum suffix suf (T, SA [ n ]) points to the linked list head ψ [0].
Suffix prefix characters: suf (T, i) the successor character in T is T [ i-1], which is of the same type as T [ i-1 ].
Suffix bucket and suffix bucket block: suffix of the same first character in SA (T) forms a suffix barrel, and a plurality of continuous suffix barrels form a suffix barrel block; in the same suffix bucket, the left side is an L-type suffix bucket, and the right side is an S-type suffix bucket.
Suffix linked list block: the integer array of the suffix linked list psi (T) is split into a plurality of blocks of fixed length.
Karp-Rabin fingerprint function: calculating the fingerprint value of a character string or sequence, the fingerprint value of a substring or sequence of any length can be calculated iteratively by using formulas (1-1) to (1-3), FP (a, b) represents the fingerprint value of the substring or sequence from a to b elements, the parameters delta and L are prime numbers and delta < L. In particular, the larger δ and L are, the lower the error probability is. The specific calculation formula is as follows, wherein seq is the calculated substring or sequence and seq [ p ] represents its p-th element.
FP(0,-1])=0 (1-1)
FP(0,p)=FP(0,p-1)*δ+Seq[p]mod L (1-2)
FP(p,q)=FP(0,q)-FP(0,p-1)*δq-p+1mod L (1-3)
Taking the character string T [1,12] = "hhaappyyaap $" as an example, each of the above terms is explained:
referring to fig. 1, the character type of the character string T, and the storage structure of SA (T) and ψ (T) are shown.
Line 3 of fig. 1 shows the character type of T. For example, the dictionary of T [11] = 'p' is greater than T [12] = '$', so T [11] is of L type, and T [11] is of L type because T [10] is of S type; t [09] = 'a' is the same as T [10] = 'a', and T [10] is S type, so T [09] is also S type, and T [09] is S type because T [08] is L type. The string T contains 2S-type substrings T [03,09] and T [09,12]; in T03,09, T06,09 and T07,09 are S and L type substrings, respectively.
Line 4 of FIG. 1 shows the memory structure of SA (T), an integer array of length 12, with the dictionary order of suf (T, SA [ i ]) being smaller than suf (T, SA [ j ]) for 1.ltoreq.i < j.ltoreq.12, e.g., suf (T, SA [2] = 09) being smaller than suf (T, SA [12] = 07), i.e., the dictionary order of the suffix T [09,12] = 'aap$' being smaller than the suffix T [07,12] = 'yyaap $'.
Line 5 of fig. 1 shows a storage structure ψ (T), where ψ [0] is a linked list head, a minimum suffix suf (T, SA [01] =12) pointing to T, ψ [ SA [01] =12 ] is a suffix suf (T, SA [02] =09), ψ [ SA [02] =09 ] is a suffix suf (T, SA [03] =03), ψ [ SA [03] =03 ] is a suffix suf (T, SA [04] =11), and so on until ψ [ SA [12] =07 ] is pointed to the linked list head.
Referring to fig. 2, a suffix bucket and suffix bucket block storage structure of a character string T is shown.
FIG. 2, line 3, shows the suffix bucket configuration of SA (T), which consists of 5 suffix buckets, namely, a '$' suffix bucket at SA [01], an 'a' suffix bucket at SA [02,05], an 'h' suffix bucket at SA [06,07], a 'p' suffix bucket at SA [08,10], and 'y' suffix buckets at SA [11,12], respectively; in the 'p' suffix bucket, SA [08] and SA [09,10] are the L and S type suffix buckets of 'p', respectively.
Line 4 of FIG. 2 shows the suffix bucket blocks of SA (T), which consists of 2 suffix bucket blocks, respectively at SA [01,05], suffix bucket blocks consisting of '$' and 'a' suffix buckets, and suffix bucket blocks consisting of 'h', 'p' and 'y' suffix buckets at SA [06,12 ].
Referring to fig. 3, there is shown a memory structure for dividing a suffix list ψ (T) of a character string T into a plurality of suffix list blocks, and row 3 divides the array of ψ (T) into list blocks of length 3, and the 0 th to 3 rd suffix list blocks of ψ (T) are sequentially ψ [01,03], ψ [04,06], ψ [07,09] and ψ [10,12 ].
The specific computing framework of the embodiment of the application is as follows: firstly, calculating the sequence of S-type suffixes of a character string T, and simultaneously calculating fingerprint values of the S-type suffixes in ascending and descending order; then, according to the sequence of the S type suffixes, calculating the sequence of the L type suffixes and the S type suffixes, and simultaneously calculating fingerprints of the descending S type suffixes; and finally, calculating the link information of the L and S type suffixes, combining the two types of suffixes to generate a final linked list, calculating the fingerprint value of the ascending S type suffixes, and comparing the ascending and descending fingerprint values to give a final verification result.
The technical scheme of the application is described below through specific examples.
Referring to fig. 4, a flow chart of steps of a method for constructing a large data link table with a verification function according to an embodiment of the present application may specifically include the following steps:
S401, reading a character string T from the external memory from right to left, comparing dictionary sequences of adjacent characters, searching S type characters, and logically dividing the character string T into a plurality of blocks by the S type characters according to the size of the internal memory;
In the embodiment of the application, the block size of the character string T is determined according to the memory size of the equipment, and the specific block principle is as follows: each chunk may be ordered in the device memory in a generalized manner unless the chunk is made up of an S-type substring that exceeds the memory size.
In the embodiment of the present application, the step of logically dividing the character string T into a plurality of blocks includes: (1) Judging whether the currently read character is of the S type, if not, reading the next character, otherwise, entering the next step; (2) Calculating the length of an S type substring from which the current S type character starts, if the length of the current block plus the length of the current S type substring exceeds the length of the equipment memory, the current block is a new block of the character string T, otherwise, the current block comprises the current S type substring, and continuing to execute the steps until the reading process is finished.
S402, scanning a character string T, calculating the number of elements of each suffix barrel, and combining the suffix barrels into a plurality of suffix barrel blocks according to the memory size of the equipment;
In the embodiment of the present application, the step of combining the suffix bucket into a suffix bucket block includes: (1) Reading the character string T once, and counting the number of elements in each suffix barrel by using an integer array with the length of 256; (2) Accumulating the number of elements of each suffix barrel from left to right, if the size of the current suffix barrel block plus the current suffix barrel exceeds the size of the equipment memory, the current suffix barrel block is a new suffix barrel block, otherwise, the current suffix barrel block contains the current suffix barrel, and continuing to perform accumulation to the right until the suffix barrel is ended.
In the embodiment of the application, the character string T is divided into a plurality of blocks, and the suffix barrel is divided into a plurality of suffix barrel blocks, so that the main purpose is to simulate memory induction sequencing, and realize memory block induction sequencing of large-scale data.
S403, recursively calling the external memory block induction ordering to order the suffixes of the S type, traversing the suffixes of the S type in ascending order and descending order respectively, and iteratively calculating ascending order and descending order fingerprint values by using a fingerprint function;
In the embodiment of the present application, the step of recursively calling the step of sorting the S-type suffix by the memory partition inductive sorting includes: (1) problem shrinkage stage: scanning the character string T, searching the S type substrings, and sorting and naming the S type substrings by adopting an external memory block induction sorting method to generate a new character string T1; (2) a recursion judgment stage: if the character of T1 is not unique, executing recursion; (3) an inductive solving stage: and sequentially sorting the L and S type suffixes by adopting an external memory block induction sorting method according to the ordered S type suffixes. In the inductive solving stage from the execution of the recursion process to the layer 0, the S-type suffix sequence of the ascending or descending order of the character string T can be calculated according to the suffix array SA (T1) of the character string T1.
In the embodiment of the present application, the step of sorting the memory blocks includes: (1) Dividing the character string T into a plurality of blocks, and dividing the suffix array SA (T) into a plurality of suffix barrel blocks; (2) Calculating L and S type preceding blocks corresponding to each block of the character string T; (3) According to each L-type suffix barrel block and the preceding sub-block, adopting a small top heap, and ordering the L-type suffixes to the corresponding L-type suffix barrel blocks by taking suffix first characters and sequence numbers as ordering keywords; (4) According to each S type suffix barrel block and the preceding sub-block, adopting a large top heap, and sequencing the S type suffixes to the corresponding S type suffix barrel blocks by taking suffix first characters and sequence numbers as sequencing keywords; (5) And adopting multi-path merging and sorting to obtain each suffix from the L and S type suffix barrel blocks one by one according to ascending order or descending order.
In the embodiment of the present application, the process of calculating the suffix fingerprint value of the ascending or descending S-type is as follows: the ascending and descending order of the S-type suffixes are traversed, and fingerprint values fp1 and fp2 of the ascending and descending order of the S-type suffixes are iteratively calculated using Karp-Rabin fingerprint function formulas (1-1) to (1-3).
S404, sequentially calculating L and S type preceding blocks corresponding to each block of the character string T according to the ordered S type suffixes;
In the embodiment of the present application, the specific process of calculating the L and S type preceding blocks of each block of the character string T is to sequentially execute the following steps for each block of the character string T: (1) Reading the ith block T i of the character string T into a memory, and initializing an S type suffix barrel by using an ascending S type suffix belonging to the T i block; (2) Traversing each suffix bucket from left to right, and orderly sequencing L-type successes of the traversed suffixes to L-type successor blocks corresponding to the current block T i by adopting a memory induction sequencing method; (3) Traversing each suffix bucket from right to left, and orderly sequencing the S type successes of the traversed suffix to the S type successor block corresponding to the current block T i by adopting a memory induction sequencing method.
S405, traversing S-type suffixes and L-type predecessor blocks, sequencing the L-type suffixes to corresponding L-type suffix barrel blocks, and storing the L-type suffixes to an external memory in an ascending order;
In the embodiment of the application, the process of ordering the L-type suffixes to the corresponding L-type suffix barrel blocks adopts a small top stack HP 1, and sequentially reads the stack top suffixes by taking the suffix head character and the sequence number as ordering keywords, and writes the stack top suffixes to the corresponding L-type suffix barrel blocks, and the specific steps are as follows: (1) Initializing HP 1 to be empty, reading the current smallest L-type suffix bucket block into a memory array Y, wherein elements of the L-type suffix bucket block are the tuples < chr, pos >, chr and pos respectively represent the first characters and positions of the suffixes, and performing stable ascending sequencing on the Y; (2) Traversing each suffix bucket of the suffixes of the types Y and S in ascending order, for the same suffix bucket, traversing the suffixes of the type S, the arrays Y and HP 1 in sequence, and for the currently traversed suffix e= < chr, pos > executing the following operations: calculating whether the successor of e is of L type, if so, reading a successor character chr 0 from a successor partition corresponding to e, if the successor belongs to a current suffix bucket, pressing a successor tuple e 0=<chr0, pos-1, id++ > into HP 1, otherwise, writing e 0 into the corresponding L type suffix bucket; if the successor of e is of the S type, appending e to the tail of the L-type suffix sequence LStar; (3) And continuing to execute the steps on the next L-type suffix bucket until all L-type bucket blocks are traversed.
S406, traversing the L-type suffix sequence LStar and the S-type preceding partition in a descending order, sequencing the S-type suffixes to corresponding S-type suffix barrel blocks, and calculating fingerprint values fp3 of the descending S-type suffixes;
In the embodiment of the application, the process of sequencing the S-type suffixes to the corresponding S-type suffix barrel blocks adopts a large top stack HP 2, and sequentially reads the stack top suffixes by taking the suffix head character and the sequence number as sequencing keywords, and writes the stack top suffixes to the corresponding S-type suffix barrel blocks, and the specific steps are as follows: (1) Initializing HP 2 to be empty, reading the current largest S type suffix bucket block into a memory array Y, wherein elements of the memory array Y are doublets < chr, pos >, and performing stable descending order sorting on Y; (2) Traversing each suffix bucket of the array Y and L-type suffix sequences LStar in descending order, for the same suffix bucket, traversing the array Y, LStar and HP 2 sequentially, and for the currently traversed suffix e= < chr, pos > performing the following operations: calculating whether the successor of e is of the S type, if so, reading a successor character chr 0 from a successor partition corresponding to e, if the successor belongs to a current suffix bucket, pressing a successor tuple e 0=<chr0, pos-1, n < - > into HP 2, otherwise, writing e 0 into the corresponding S type suffix bucket; if e is L-type, iteratively calculating fingerprint values fp3 of descending S-type suffixes by using Karp-Rabin fingerprint function formulas (1-1) to (1-3); (3) And continuing to execute the steps on the next S-type suffix bucket until all S-type bucket blocks are traversed.
S407, removing each suffix from the L and S type suffix barrel blocks according to the dictionary sequence, calculating the link information of each suffix, generating a final linked list, and calculating fingerprint values fp4 of the ascending S type suffixes in the period;
in the embodiment of the application, the specific steps for generating the final linked list are as follows: (1) Sequentially taking out each suffix from the L-type suffix barrel block and the S-type suffix barrel block according to dictionary sequence by adopting multi-path merging and sorting; (2) Calculating pointer fields of the suffixes according to the sequence relation of two adjacent suffixes, and sequencing the suffixes to a suffix chain table block to which the suffixes belong according to the positions of the suffixes; (3) If the current suffix is of the S type, iteratively calculating fingerprint values fp4 of the suffixes of the ascending S type by using Karp-Rabin fingerprint function formulas (1-1) to (1-3); (4) And sequencing the suffixes in the suffix linked list blocks according to positions, and sequentially merging the suffix linked list blocks from left to right to generate a final linked list.
In the embodiment of the application, the calculation process of the pointer domain of the suffix and the suffix linked list block comprises the following steps: assuming that the lengths of suffix linked list blocks are K, e 1, e 2 and e 3 are 3 suffixes which are sequentially fetched, setting a pointer field of e 1 to be e 2. Pos when e 2 is fetched, constructing a binary group < e 1.pos,e2. Pos > of e 1, and saving the binary group to the e 1. Pos/K suffix linked list blocks, wherein e 1. Pos and e 2. Pos are the positions of e 1 and e 2 respectively; when e 3 is fetched, calculating a pointer field and a suffix linked list block of e 2; and so on until the last prefix is taken, setting the pointer field to 0, i.e. pointing to the chain header.
And S408, comparing fingerprint values of the suffixes of the type S in ascending order and descending order in the steps S403, S406 and S407, and verifying the correctness of the final linked list.
In the embodiment of the present application, the process of verifying the correctness of the final linked list is that if the fingerprint value fp1 in step S403 is the same as the fingerprint value fp4 in step S408, and the fingerprint value fp2 in step S403 is the same as the fingerprint value fp3 in step S407, the final linked list is correct, otherwise it is wrong.
Referring to fig. 5, a schematic structural diagram of a large data link table building device with verification function according to an embodiment of the present application is shown, which may specifically include the following modules:
The character string preprocessing module 501 is configured to read a character string T from an external memory of the electronic device from left to right, search for S-type characters by comparing word patterns of adjacent characters, and segment the character string T into a plurality of blocks by using the S-type characters according to a size of a device memory; simultaneously counting the sizes of each suffix barrel of the character string T, and dividing the suffix barrel into a plurality of suffix barrel blocks according to the memory size of the equipment;
The S-type suffix ordering module 502 is configured to recursively use a method of block inductive ordering to order S-type suffixes of the character string T, traverse the S-type suffixes in ascending and descending order, and iteratively calculate fingerprint values of the S-type suffixes in ascending and descending order by using a Karp-Rabin fingerprint function;
a suffix preceding character calculation module 503, configured to sequentially read each block of the character string T into the memory, and calculate L and S type preceding blocks corresponding to each block of the character string T by using a memory induction ordering method;
The L-type suffix ordering module 504 calculates an L-type suffix bucket block of the character string T according to the S-type suffix and the L-type successor block by using a memory block induction ordering method, and writes the L-type suffix out to a memory;
The S-type suffix sorting module 505 calculates an S-type suffix bucket block of the character string T according to the L-type suffix and the S-type preceding partition block by using a memory partition inductive sorting method, and simultaneously iteratively calculates fingerprint values of descending S-type suffixes by using a Karp-Rabin fingerprint function;
the suffix linking module 506 uses external memory multipath merging and sorting to sequentially obtain each suffix from L and S type suffix barrel blocks according to the word pattern sequence, calculate the linking information of each suffix, sort each suffix to the suffix linked list block to which each suffix belongs, sort and merge each linked list block to generate a final linked list, and simultaneously use Karp-Rabin fingerprint function to iteratively calculate the fingerprint value of the ascending S type suffix;
The verification module 507 compares whether fingerprint values of the type S suffixes in ascending order and descending order in the modules 502, 505 and 506 are the same, and gives a verification result of the final linked list.
Referring to fig. 6, the embodiment of the application also discloses an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the construction method of the large-scale data link list with the verification function according to the previous embodiments when executing the computer program.
The electronic device 600 may be the electronic device in the foregoing embodiments, and the electronic device may be a computing device such as a desktop computer, a cloud server, or the like. The electronic device 600 may include, but is not limited to, a processor 610, a memory 620. It will be appreciated by those skilled in the art that fig. 6 is merely an example of an electronic device 600 and is not meant to be limiting of the electronic device 600, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device 600 may also include input and output devices, network access devices, buses, etc.
The Processor 610 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 620 may be an internal storage unit of the electronic device 600, such as a memory of the electronic device 600. The memory 620 may also be a removable external storage device of the electronic device 600, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 600. Further, the memory 620 may also include both internal storage units and external storage devices of the electronic device 600. The memory 620 is used to store the computer program 621 and other programs and data required by the electronic device 600. The memory 620 may also be used to temporarily or permanently store data that has been output or is to be output.
The terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention as set forth in the claims.

Claims (7)

1. The method for constructing the large data linked list with the verification function is characterized by comprising the following steps of:
(1) Reading a character string T from the external memory from the right to the left, searching the position where the S type character appears by comparing the dictionary sequences of adjacent characters, and logically dividing the character string T into a plurality of blocks by taking the S type character as a separator according to the size of the equipment memory;
(2) Scanning the character string T, counting the number of elements in each suffix barrel, and dividing the suffix barrel into different suffix barrel blocks according to the size of the memory of the equipment, wherein the blocks can be subjected to inductive sorting in the memory; the suffix barrel block consists of continuous suffix barrels, and data blocks stored on an external memory are partitioned; the suffix bucket is composed of suffixes with the same initial letters in the suffix array;
(3) Adopting external memory block induction sequencing, calculating the sequence of S-type suffixes of the character string T, and iteratively calculating fingerprint values fp1 and fp2 of the S-type suffixes in ascending and descending order by using a fingerprint function;
(4) According to the ordered S type suffixes, calculating the L type of each block of the character string T and the preceding character of the S type suffixes, and sequentially storing the L type of each block of the character string T and the preceding block of the S type corresponding to each block of the character string T; the L and S type preceding blocks consist of preceding characters of L and S type suffixes, and are respectively stored in the external data blocks;
(5) Traversing S-type suffixes and L-type predecessor blocks, adopting a small top heap, using suffix head characters and sequence numbers as ordering keywords, ordering the L-type suffixes to suffix barrel blocks to which the L-type suffixes belong, and sequentially storing the L-type suffixes to an external memory;
(6) Traversing L-type suffixes and S-type preceding blocks in a descending order, adopting a large top heap, ordering the S-type suffixes to suffix barrel blocks to which the S-type suffixes belong by taking suffix head characters and sequence numbers as ordering keywords, and iteratively calculating fingerprint values fp3 of the descending S-type suffixes by using a fingerprint function;
(7) Sequentially acquiring each suffix from L and S type suffix barrel blocks according to a dictionary sequence by adopting multi-path merging and sorting, calculating a pointer domain of the suffix and a suffix linked list block to which the pointer domain belongs according to position information of adjacent suffixes, sorting the suffix linked list blocks according to positions, merging each suffix linked list block to generate a final linked list, and iteratively calculating fingerprint values fp4 of ascending S type suffixes by using a fingerprint function during the period;
(8) And comparing the ascending order and the descending order fingerprint values of the S-type suffix, and giving out a final generated suffix linked list correctness verification result.
2. The method of claim 1, wherein the ordering L-type suffixes to the suffix bucket to which they belong: the method comprises the following steps of sequentially reading stack top suffixes by adopting a small top stack HP 1 and taking suffix head characters and sequence numbers as sequencing keywords, and writing the stack top suffixes into corresponding L-type suffix barrel blocks: (1) Initializing HP 1 to be empty, reading the current smallest L-type suffix bucket block into a memory array Y, wherein elements of the L-type suffix bucket block are the tuples < chr, pos >, chr and pos respectively represent the first characters and positions of the suffixes, and performing stable ascending sequencing on the Y; (2) Traversing each suffix bucket of the suffix of the type Y and S in ascending order, traversing the suffix bucket of the type S, the suffix bucket of the type Y and HP 1 in turn for the same suffix bucket, and reading the preceding character chr 0 of the currently traversed suffix e= < chr, pos > from the preceding partition corresponding to the suffix if the prefix is of the type L; if the successor belongs to the current suffix bucket, pressing a successor tuple e 0=<chr0, pos-1, id++ > into the HP 1, otherwise writing e 0 to the corresponding L-type suffix bucket; if the successor of e is of the S type, appending e to the tail of the L-type suffix sequence LStar; (3) And continuing to execute the steps on the next L-type suffix bucket until all L-type bucket blocks are traversed.
3. The method of claim 1, wherein the ordering of the S-type suffixes to the suffix bucket to which they pertain is: the method comprises the following steps of sequentially reading stack top suffixes by using a large top stack HP 2 and taking suffix head characters and sequence numbers as sequencing keywords, and writing the stack top suffixes into corresponding S-type suffix barrel blocks: (1) Initializing HP 2 to be empty, reading the current largest S type suffix bucket block into a memory array Y, wherein elements of the memory array Y are doublets < chr, pos >, and performing stable descending order sorting on Y; (2) Traversing each suffix bucket of the suffix sequences LStar of the array Y and L types in descending order, traversing arrays Y, LStar and HP 2 sequentially for the same suffix bucket, and reading its successor character chr 0 from its corresponding successor block if its successor is of the S type for the currently traversed suffix e= < chr, pos >; if the successor belongs to the current suffix bucket, pressing the successor tuple e 0=<chr0, pos-1, id- > into the HP 2, otherwise writing e 0 to its corresponding S-type suffix bucket; if the successor of e is L type, iteratively calculating fingerprint values fp3 of descending S type suffixes by using Karp-Rabin fingerprint functions; (3) And continuing to execute the steps on the next S-type suffix bucket until all S-type bucket blocks are traversed.
4. The method of claim 1, wherein the calculation of the pointer field and suffix linked list block for the suffix: assuming that the suffix linked list block length is K, e 1, e 2 and e 3 are 3 suffixes which are sequentially fetched, when e 2 is fetched, setting the pointer field of e 1 to be e 2. Pos, constructing a binary group < e 1.pos,e2. Pos > of e 1, and saving the binary group to the e 1. Pos/K suffix linked list block, wherein e 1. Pos and e 2. Pos are the positions of e 1 and e 2 respectively; when e 3 is fetched, calculating a pointer field and a suffix linked list block of e 2; and so on until the last prefix is fetched, the pointer field is set to 0, and the pointer is pointed to the chain header.
5. The method of claim 1, wherein the step of merging each suffix list block to generate a final list includes first sorting suffixes in the suffix list blocks according to positions, and then merging each suffix list block from left to right in sequence to obtain the final list.
6. The device for constructing the large data linked list with the verification function is applied to the method for constructing the large data linked list as claimed in claim 1, and is characterized by comprising the following steps: the character string preprocessing module is used for dividing the character string T into a plurality of blocks and dividing the suffix barrel into a plurality of suffix barrel blocks; the S-type suffix ordering module is used for ordering the S-type suffixes of the character string T and calculating fingerprint values of the ascending and descending S-type suffixes; the suffix prefix character calculation module is used for calculating L and S type suffix prefix characters of each block of the character string T and storing the L and S type suffix prefix characters into the corresponding preceding blocks of the external memory; the L-type suffix ordering module calculates the sequence of the L-type suffixes according to the sequence of the S-type suffixes, and stores the L-type suffixes into an external memory; the S type suffix ordering module calculates the sequence of the S type suffixes according to the sequence of the L type suffixes, and calculates fingerprint values of descending S type suffixes; the suffix link module links the suffixes in the L and S type suffix barrel blocks to generate a final suffix linked list, and calculates fingerprint values of the ascending S type suffixes; and the verification module is used for verifying the correctness of the suffix linked list by comparing the fingerprint values of the suffixes of the ascending order and the descending order.
7. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements a method of constructing a large data link table with authentication functionality as claimed in claim 1 when executing the computer program.
CN202410036277.3A 2024-01-10 2024-01-10 Construction method and construction device of large data linked list with verification function Pending CN117971826A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410036277.3A CN117971826A (en) 2024-01-10 2024-01-10 Construction method and construction device of large data linked list with verification function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410036277.3A CN117971826A (en) 2024-01-10 2024-01-10 Construction method and construction device of large data linked list with verification function

Publications (1)

Publication Number Publication Date
CN117971826A true CN117971826A (en) 2024-05-03

Family

ID=90850432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410036277.3A Pending CN117971826A (en) 2024-01-10 2024-01-10 Construction method and construction device of large data linked list with verification function

Country Status (1)

Country Link
CN (1) CN117971826A (en)

Similar Documents

Publication Publication Date Title
JP3067980B2 (en) String matching method and apparatus
US10521441B2 (en) System and method for approximate searching very large data
US7603346B1 (en) Integrated search engine devices having pipelined search and b-tree maintenance sub-engines therein
RU2629440C2 (en) Device and method for acceleration of compression and decompression operations
CN87100829A (en) The method and apparatus that is used to retrieve
US20140019486A1 (en) Logic Content Processing for Hardware Acceleration of Multi-Pattern Search
CN106202548A (en) Date storage method, lookup method and device
US6144986A (en) System for sorting in a multiprocessor environment
US7653619B1 (en) Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that support variable tree height
WO2021072874A1 (en) Dual array-based location query method and apparatus, computer device, and storage medium
CN110837584B (en) Method and system for constructing suffix array in block parallel manner
CN107015951B (en) Method and system for verifying correctness of suffix array
Sirén Burrows-Wheeler transform for terabases
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
CN102193995B (en) Method and device for establishing multimedia data index and retrieval
Holt et al. Constructing Burrows-Wheeler transforms of large string collections via merging
US7725450B1 (en) Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that maintain search coherence during multi-cycle update operations
CN115982310B (en) Chain table generation method with verification function and electronic equipment
WO1998036349A1 (en) N-way processing of bit strings in a dataflow architecture
CN110097581B (en) Method for constructing K-D tree based on point cloud registration ICP algorithm
CN115982311B (en) Method and device for generating linked list, terminal equipment and storage medium
CN111384972A (en) Optimization method and device of multi-system LDPC decoding algorithm and decoder
CN117971826A (en) Construction method and construction device of large data linked list with verification function
CN111951894A (en) Solid state drive and parallelizable sequence alignment method
CN117271533B (en) Construction method and device of large data linked list and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication