CN117290674B

CN117290674B - Method and system for counting and positioning repeated codes of large-data-volume random bit sequence

Info

Publication number: CN117290674B
Application number: CN202311566659.9A
Authority: CN
Inventors: 赵嘉程; 周琛
Original assignee: Zhejiang Quantum Technologies Co ltd
Current assignee: Zhejiang Quantum Technologies Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-04-05
Anticipated expiration: 2043-11-23
Also published as: CN117290674A

Abstract

A method and system for large data random bit sequence repeated code statistics and positioning includes obtaining random bit sequence to be tested; initializing a test system; sample extraction and divide-by-treat storage; searching for repeated codes in the sample storage set; accurate positioning and length expansion of the repeated codes; the large data volume random bit sequence repetition code statistics and positioning system comprises a data input module, a test system initialization module, a sample extraction and divide-and-conquer storage module, a sample storage set repetition check module and a repetition code accurate positioning and length expansion module. Compared with the prior art, the method ensures that the repeated codes are guided to the same sample storage set, and also ensures the operability of a single sample storage set in a general generation environment; the operation consumption of repeated code positioning is reduced, namely, the time and space efficiency of repeated code statistics and positioning are improved; the efficiency of the system for checking the duplicate of the large data volume sample is effectively improved; all the repeated code elements are ensured to be detected, and the efficiency and quality of repeated check are improved.

Description

Method and system for counting and positioning repeated codes of large-data-volume random bit sequence

Technical Field

The invention relates to the technical field of big data, in particular to a method and a system for counting and positioning the repeated codes of a random bit sequence with large data volume.

Background

With the continuous innovation and development of quantum information technology, the commercialization degree of related industries is continuously improved. Random number generators using quantum random processes as entropy sources offer significant improvements in both the random sequence output rate and quality. In terms of output rate, some domestic manufacturers are capable of realizing 600Mbps network port output rate and supporting quite a few application environments. For the quality of the random number sequence, it is generally required to pass the detection term defined in the GM/T0005-2021 random number detection Specification. The statistics of the repeated sequences present in the random bit sequence can describe to some extent the quality of the original random number sequence. For large data volumes of original random bit sequences, such as 10GB, it is often prescribed that the length of the repetition code to be counted ranges between 64 bits and 80 bits. Looking at only the 64-bit repetition code contained inside the 10GB random bit sequence, the sample size is 85,899,345,857. Such data volumes are currently not directly analyzed and processed by most computers. In terms of large data volume duplication, a Bitmap is typically used to map a sample set, and then the duplication code results are summed up according to the repeated mapping results that occur during the mapping process. However, this method cannot cope with the problem, because the range of 64-bit samples is 0-2 ζ4, and using Bitmap to map the sample space requires 21474837648 GB of space, which is obviously impossible to realize. The Bloom Filter algorithm based on Bitmap uses multiple hashes to reduce the mapping space, but this approach has data errors, resulting in an unacceptable inability to get accurate results.

Disclosure of Invention

The invention aims to provide a method and a system for counting and positioning a large-data-volume random bit sequence repetition code, so as to realize the functions of efficient and accurate repetition code searching counting and positioning in a large data volume.

The technical scheme of the invention is realized as follows:

a method for counting and positioning the repeated codes of a large data volume random bit sequence comprises the following steps:

obtaining a random bit sequence to be detected;

initializing a test system, determining a test scale, setting processing parameters, acquiring and constructing an initial sample of a random bit sequence, and adjusting an initial sample byte sequence;

sample extraction and divide-and-conquer storage, extracting all samples contained in a random bit sequence, and storing the samples to different sample storage sets according to preset divide-and-conquer storage conditions;

searching for the repeated codes in the sample storage set, acquiring all samples in the sample storage set, detecting the repeated codes, and outputting the repeated code samples to the repeated code pair set;

the method comprises the steps of accurately positioning and expanding the length of a repeated code, traversing a repeated code pair set, calculating the accurate position of repeated code elements in a random number sequence, obtaining the complete length of the repeated code, and eliminating repeated counted repeated codes according to position information;

and (5) performing repeated code statistics and positioning on the random number sequences with large data volume.

Preferably, during initialization of the test system, the initial sample is constructed as a sample data structure containing a sample sequence and sample position information, the sample sequence being filled with a 64-bit random bit sub-sequence.

Preferably, the sample sequence is adjusted to a proper storage order according to the host byte sequence, and the sample position information is composed of a sector number and a segment number, and the division of the sector number and the segment number is determined according to the random bit sequence data amount actually detected.

Preferably, in the process of sample extraction and divide-and-conquer storage, samples in a random bit sequence are extracted, previous samples are updated by basic bit operation and combination of current processing position bit information, and sample position information is adjusted, the preset divide-and-conquer storage condition is set according to the random bit sequence, the extracted samples are uniformly distributed in a specified sample storage set according to the preset condition, and each sample storage set can be directly read into a memory for processing.

Preferably, in the process of searching the repeated codes in the sample storage set, the repeated code searching method screens out repeated code elements according to a rapid sorting method in combination with comparison operation in the sorting process, constructs repeated code pairs comprising two repeated code element bit sequences and position information, and outputs the repeated code pairs to a preset repeated code pair set.

Preferably, in the process of accurate positioning and length expansion of the repeated codes, the accurate position of the repeated code element in the random bit sequence is calculated, and the matching sequence is directly positioned in a segment interval corresponding to the random bit sequence based on sample position information constructed in the data extraction process.

The invention also provides a system for counting and positioning the repeated codes of the random bit sequences with large data volume, which comprises:

the data input module is used for acquiring a random bit sequence from the quantum random number generator or the system and detecting the random bit sequence to be detected subsequently;

the test system initialization module is used for determining the size of a single sample structure according to system parameter input and system preset, and running parameters such as the total number of sample storage sets, initial samples and the like;

the sample extraction and divide-and-conquer storage module is used for extracting all samples from the random bit sequence and uniformly storing the samples into the sample storage set according to the divide-and-conquer function;

the sample storage set duplicate checking module is used for recording the duplicate code elements contained in the corresponding sample storage set and constructing duplicate code pairs by utilizing a quick ordering process in the sample storage set;

and the repeated code accurate positioning and length expanding module is used for accurately positioning the found repeated codes, screening and removing the repeated codes contained by the longer repeated codes, and constructing a repeated code result linked list.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a method and a system for large-data random bit sequence repetition code statistics and positioning, which comprises the following steps: by sequential adjustment of the initial samples, extraction of the samples is accomplished on the basis of using only bit manipulation. And simultaneously, a large data amount sample is uniformly mapped to different sample storage sets according to a divide-and-conquer function (remainder), so that not only is the fact that the repeated codes are necessarily guided to the same sample storage set ensured, but also the operability of a single sample storage set in a general generation environment is ensured. The designed sample structure reduces the occupation of the sample storage space to a certain extent, reduces the operation consumption of repeated code positioning, and improves the time and space efficiency of repeated code statistics and positioning.

Multiple parallel real-time processing mechanisms of the duplicate checking process: the independence of the sample storage units creates the condition that the duplicate checking process can be processed in parallel under the high-performance system condition, a plurality of duplicate checking processes are operated in the system in real time, each process independently checks duplicate of one sample storage set, and the efficiency of the system for checking duplicate of a large data volume sample is effectively improved;

method and system for realizing duplicate checking based on rapid ordering processing process: the method can realize high-efficiency duplicate checking efficiency based on the quick sequencing, realizes accurate duplicate checking based on the comparison operation in the quick sequencing process, ensures that all the duplicate code elements are detected, and improves the duplicate checking efficiency and quality.

Drawings

Fig. 1 is a flow chart of a method for counting and positioning a repetition code according to a first embodiment of the present invention;

fig. 2 is a flow chart of a system for counting and positioning a repetition code according to a second embodiment of the present invention;

fig. 3 is a schematic flow chart of a sample extraction and divide-and-conquer storage module according to a second embodiment of the invention;

fig. 4 is a schematic diagram of repetition conditions in a repetition code accurate positioning and length expansion module according to a second embodiment of the present invention;

fig. 5 is a schematic diagram of a parallel processing framework of a duplication checking process of a duplication code statistics and positioning system according to a third embodiment of the present invention.

Detailed Description

The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown.

In a first embodiment, as shown in fig. 1, the embodiment of the present invention discloses a method for counting and positioning a large data volume random bit sequence repetition code, the method includes the following steps:

s1: and acquiring a random bit sequence to be detected, and acquiring the random bit sequence from a quantum random number generator or a system for subsequent detection.

S2: and initializing a test system, and determining the total number of logically divided sectors and the total number of segments according to the total data N. And meanwhile, determining the number C of the sample storage sets according to the total data amount N of the random bit sequence to be detected, the size of a single sample structure and the size S of the single sample storage set, and then creating and opening the C sample storage sets. The first 64 bits (8 bytes) of the random bit sequence are obtained and the bit sequence is adjusted to be the initial sample.

S3: sample extraction and divide-and-conquer storage, starting from the 9 th byte of the random bit sequence, reading the random bit sequence content byte by byte, executing 8 times of loops for each byte read, carrying out integral left shift on the bit sequence of the previous sample in each loop, and placing the bit position of the current read byte corresponding to the loop times in the lowest bit of the sample bit sequence. And obtaining a sample after the completion, wherein the position information of the sample is determined according to the current processing position. And in the divide-and-conquer storage process, each time a sample is obtained, the number C of the sample storage set is modulo by the bit sequence of the sample, and the sample storage set sequence number Index is obtained. The current sample is represented and stored in the sample storage set corresponding to Index. If there are duplicate codes, then they will be assigned to the same sample storage set if the Index values they calculate must be the same.

S4: and checking the sample storage sets, reading the sample storage sets into the memory one by one, and determining the total number of samples contained in the sample storage sets. All samples in the sample storage set are ordered using a fast ordering algorithm, the ordering criterion being the size of the bit sequence of the samples. And outputting the repeated code pairs constructed by the equal samples to the repeated code pair set. And obtaining a complete repeated code pair set after finishing the repeated searching operation of all the sample storage units.

S5: and (3) accurately positioning the repeated codes and expanding the length, constructing a repeated code result linked list, traversing each pair of repeated code samples in the repeated code pair set, finding out a corresponding segment in the original random bit sequence according to the position information carried by the repeated code samples, and shifting the segment. After matching the corresponding repeated code bit sequence, duan Nabi bit offset is recorded, and finally the repeated code position can be accurately positioned according to the sector number, the segment number and the intra-segment offset of the sample. After two repeated codes in the repeated code pair are accurately positioned, the length of the repeated codes is expanded, the maximum length of the repeated codes is obtained, the repeated code result linked list is added after the repeated code result linked list is judged to be not overlapped, and if the overlapped condition exists, the content of overlapped elements in the linked list is updated to be longer repeated codes.

Thus, the repeated code statistics and positioning of the random bit sequence with large data volume are completed.

In particular, the test system is initialized, and when the random bit sequence is logically divided into sectors and segments, the space is occupied as little as possible under the condition that the sample position can be represented. For example, for a 10GB random number sequence, the location information may be recorded in two bytes, where the sector number is represented by the upper 4 bits of the first bit, and 4 bits may represent 16 sectors, for convenience we consider each 1GB in the original random bit sequence as 1 sector, and there are 10 sectors in total. The segment number is represented by the remaining 12 bits, which can represent a range of 4096, corresponding to a 1GB random number, i.e. 262114 x 8 samples per segment.

Specifically, the test system is initialized, and the first 64 bits of data of the random bit sequence are acquired and then are processed to be used as the bit sequence of the initial sample. Because of the different host endian, the result obtained by the overall left shift of the directly acquired 64-bit sequence is different from the result of logically shifting the 64-bit sample sequence one bit back from the beginning of the random number sequence. For example, in a small end system the original 64 bits need to be reversed in byte order.

Specifically, the sample extraction and divide and conquer storage are performed each time the sample extraction is performed based on the previous sample. Only one byte is processed at a time, the byte sequence problem is not needed to be considered, and only the bit processing flow is needed to be completed. The divide and conquer storage uses modes including file storage, mySQL and the like, and a proper scheme is selected according to an actual application scene. The number C of the sample storage sets indicates that the data are uniformly distributed in C files or C data tables respectively, and the calculated sample storage set sequence number Index indicates the file sequence number or the database table sequence number.

Specifically, the sample storage set is searched again, the single set is rapidly ordered, and the repeated codes are matched in the ordering process. The corresponding re-code constructed after being matched with the re-code comprises a re-code bit sequence, re-code A position information and re-code B position information.

Specifically, the exact positioning and length expansion of the repeated codes are similar to the data extraction process in the offset matching process of a single repeated code on a corresponding segment. Namely, only judging whether the data are matched after the data are extracted. And for the two repeated code elements of the repeated code pair, after finding the accurate positions of the two repeated code elements, matching the two repeated code elements according to the bits until the two repeated code elements are not identical, namely the repeated code length.

Specifically, the initial state of the constructed repeated code result linked list is null, and the repeated code is accurately positioned and expanded in length. The linked list elements should contain: complete repetition code bit sequence, repetition code length, repetition code a accurate position, and repetition code B accurate position. After the length expansion of the repeated codes is completed, whether the subsequence of the user is already in the chain table or not is found out from the existing repeated code result chain table according to the length and the accurate position information. If it is one of the above two cases, the sequence in the linked list is updated to be the longer one according to the case.

In a second embodiment, as shown in fig. 2, the embodiment of the present invention discloses a system for calculating and locating a large data volume random bit sequence repetition code, which includes the following components:

the data input module 201 is configured to obtain a random bit sequence from a quantum random number generator or system, and to perform detection later.

The test system initialization module 202 is configured to determine the size of a single sample structure, the total number of sample storage sets, and the operation parameters such as an initial sample according to the system parameters and the system presets.

The sample extraction and divide and conquer storage module 203 is configured to extract all samples from the random bit sequence and uniformly store the samples in the sample storage set according to the divide and conquer function.

And the sample storage set check and reconstruction module 204 is configured to record, in the belonging sample storage set, the repetition code elements contained in the corresponding sample storage set by using a fast ordering process, and construct a repetition code pair.

And the repeated code accurate positioning and length expanding module 205 is used for accurately positioning the found repeated codes, screening and removing the repeated codes contained by the longer repeated codes, and constructing a repeated code result linked list.

Specifically, the system is added into the total data of the random bit sequence, and the system is preset to be the total number of logically divided sectors and the total number of segments in the sectors and the actual size of the random bit sequence covered by each sector. The system preset parameters should be properly selected according to the total data amount of the tested random bit sequence, and the total size of the sector and the segment number is an integer number of bytes. The size of a single sample structure is the size of the sample sequence plus the size of the location information consisting of the sector and segment number. The total storage space required to store the samples can be calculated from the individual sample size and the total sample size. And then the total number of the sample storage sets can be calculated by the total storage space and the size of the single sample storage set. The first 8 bytes of the random bit sequence are read, and the initial sample is generated by adjusting the position according to the byte order of the host.

Specifically, the specific flow of the sample extraction and divide-and-conquer storage module 203 is shown in fig. 3:

s2031: entering a sample extraction and divide-and-conquer storage module;

s2032: acquiring a byte to be processed currently from the random bit sequence, and starting from the 9 th byte of the random bit sequence;

s2033: it is determined whether the samples in the current byte have been extracted. If not, continuing to extract in S2034, and if the data extraction in the current byte is completed, continuing to S2035;

s2034: the new sample is generated by shifting the bit sequence of the previous sample left one bit as a whole and then filling the corresponding bit of the currently processed byte with the lowest bit. After obtaining a new sample, taking a modulus of the total number of the sample storage sets represented by the bit sequence of the new sample to obtain a sample storage set sequence number Index, and storing the new sample into a sample storage set corresponding to the sequence number Index;

s2035: judging whether all samples in the random bit sequence are extracted, if not, turning to S2032 to continuously extract samples in the next byte, otherwise turning to S2036;

s2036: ending the sample extraction and divide-and-conquer process, highlighting the current module;

specifically, in the sample storage set duplicate checking module 204, when duplicate checking is performed, the number of times of comparison required in the duplicate checking process is reduced by capturing the same elements in the fast ordering algorithm flow. The time complexity of the check is reduced to O (nlogn). This process examines all elements in the sample storage set so that no errors occur and the total number of codes we have will be accurate.

Specifically, in the accurate positioning and length expanding module 205, the accurate positioning process is similar to the sample extraction process in the sample extraction and divide-and-conquer storage module 203, but the data field is reduced to a certain segment according to the position information during positioning, so that the matching times are greatly reduced. The extension of the length of the repeated codes after positioning can obtain the length of the repeated codes only by comparing the same bit of the two repeated codes all the time.

Specifically, in the precise positioning and length expanding module 205, the longer repeated sub-sequence repetition statistics are the case caused by that the length of the repeated code is greater than the length of the sample bit sequence, which is described with reference to the figure.

As shown in fig. 4, one (R0-R65) of a set of repetition codes with a length 66 is in the random bit sequence 2051, according to the processing in the sample extraction and divide-and-conquer storage module 203, three bit sequences 2052, 2053 and 2054 are extracted from the random bit sequence, and at this time, the three bit sequences are stored into different sample storage sets according to the divide-and-conquer result, and respectively touch another one of the set of repetition codes, and are then screened out again. And then expanding the length. Then bit sequence 2052 will be spliced 2055 and bit sequence 2053 will be spliced 2056. Three code sequences, 64, 65, 66 in length, respectively, are extracted from a single code. Longer codon sequence repeat statistics occur. In this case, the judgment processing is performed when the repetition result linked list is finally constructed.

In the third embodiment, as shown in fig. 5, on the basis of the large data random bit sequence repetition code statistics and positioning system described in the second embodiment of the invention, a frame for parallel processing of a repetition code checking and repetition module is disclosed, and is described as follows:

repetition statistics and positioning system instance 501: the repeated code statistics and positioning system example comprises N repeated check execution examples, C sample storage sets and one repeated code pair set.

Sample store aggregate list 502: the sample storage set list is the result output by the system sample extraction and divide-and-conquer storage module 203. The interior contains a total of C mutually independent sample storage sets. The total number of sets C is determined by the total number of samples and the preset single sample storage set size. The individual sample storage sets are input separately in order to any free instance in the duplicate process instance list 503 during parallel processing.

Check duplicate process instance list 503: the duplicate checking process instance list is a set of all processes of parallel duplicate checking processing. Comprises N mutually independent duplicate checking processes. The total number of processes N should be determined according to the system running environment. The duplication checking process acquires one sample storage set in the sample storage set list 502, executes the duplication checking process, and outputs the result to the duplication code pair set 504.

A set of repetition code pairs 504: the set of pairs of codes contained should be output by the set check and reconstruction module 204 for the system sample store. The method is used for storing the information of the repeated code pairs, and each piece of information consists of a repeated code sequence, the position information of the repeated code A and the position information of the repeated code B. Information is built and transmitted by any process that completes the duplication checking task by the duplication checking process instance list 503.

Specifically, the total number N of the re-process instances in the re-code statistics and positioning system instance 501 should be determined by the system running environment resource. The method mainly relates to the CPU core number and the available memory of the system. To ensure normal operation of the system, the total number N of process instances should be smaller than the number of CPU cores, and the sum of the memory occupied by all processes is smaller than the total number of memory available to the server under the requirement.

By integrating the structure, the method and the embodiment of the invention, the invention ensures that the repeated codes are necessarily guided in the same sample storage set, and also ensures the operability of a single sample storage set in a general generation environment. The designed sample structure reduces the occupation of the sample storage space to a certain extent, reduces the operation consumption of repeated code positioning, namely improves the time and space efficiency of repeated code statistics and positioning. Multiple parallel real-time processing mechanisms of the duplicate checking process: the independence of the sample storage units creates the condition that the duplicate checking process can be processed in parallel under the high-performance system condition, a plurality of duplicate checking processes are operated in the system in real time, each process independently checks duplicate of one sample storage set, and the efficiency of the system for checking duplicate of a large data volume sample is effectively improved; method and system for realizing duplicate checking based on rapid ordering processing process: the method can realize high-efficiency duplicate checking efficiency based on the quick sequencing, realizes accurate duplicate checking based on the comparison operation in the quick sequencing process, ensures that all the duplicate code elements are detected, and improves the duplicate checking efficiency and quality.

Claims

1. A method for large data volume random bit sequence repetition code statistics and positioning, comprising the following steps:

obtaining a random bit sequence to be detected;

sample extraction and divide-by-divide-and-conquer storage, extracting all samples contained in a random bit sequence, storing the samples into different sample storage sets according to preset divide-by-conquer storage conditions, in the sample extraction and divide-by-conquer storage process, extracting samples in the random bit sequence, updating previous samples by combining current processing position bit information through basic bit operation, and adjusting sample position information, wherein the preset divide-by-conquer storage conditions are set according to the random bit sequence, the extracted samples are uniformly distributed into a specified sample storage set according to the random bit sequence, each sample storage set can be directly read into a memory for processing, and each sample is obtained, the number C of the sample storage sets is modulo by the bit sequence of the sample to obtain a sample storage set sequence Index, the current samples are represented and stored into the sample storage sets corresponding to the Index, and if the current samples have a double code, the calculated Index values are the same and are distributed to the same sample storage set;

the method comprises the steps of accurately positioning and expanding the length of a repeated code, traversing a repeated code pair set, calculating the accurate position of repeated code elements in a random bit sequence, obtaining the complete length of the repeated code, and eliminating repeated statistical repeated codes according to position information;

and (5) performing repeated code statistics and positioning on the random bit sequences with large data volume.

2. The method for large data volume random bit sequence repetition statistics and localization of claim 1, wherein during initialization of the test system, the initial samples are constructed as a sample data structure containing sample sequences and sample location information, the sample sequences being filled with 64-bit random bit subsequences.

3. The method for large data volume random bit sequence repetition code statistics and localization as claimed in claim 1, wherein the sample sequence is adjusted to a proper storage order according to the host byte sequence, the sample position information is composed of sector numbers and segment numbers, and the division of the sector numbers and the segment numbers is determined according to the actually detected random bit sequence data volume.

4. The method for counting and positioning the repeated codes of the random bit sequences with large data volume according to claim 1, wherein in the repeated code searching process in the sample storage set, the repeated code searching method screens out repeated code elements according to a rapid sequencing method in combination with a comparison operation in the sequencing process, constructs repeated code pairs comprising two repeated symbol element bit sequences and position information, and outputs the repeated code pairs to a preset repeated code pair set.

5. The method for counting and positioning the repeated codes of the random bit sequence with large data volume according to claim 1, wherein the calculation of the accurate positions of the repeated code elements in the random bit sequence is based on the sample position information constructed in the data extraction process in the accurate positioning and length expansion process of the repeated codes, and the matching sequence is directly positioned in the segment interval corresponding to the random bit sequence.

6. A system for large data volume random bit sequence repetition code statistics and localization for implementing the large data volume repetition code statistics and localization method according to any one of claims 1-5,

comprising the following steps:

the sample extraction and divide-and-conquer storage module is used for extracting all samples from a random bit sequence and uniformly storing the samples into a sample storage set according to a divide-and-conquer function, in the sample extraction and divide-and-conquer storage process, the samples in the random bit sequence are extracted, the previous samples are updated by combining the bit information of the current processing position through basic bit operation, the sample position information is adjusted, the preset divide-and-conquer storage condition is set according to the random bit sequence, the extracted samples are uniformly distributed into the specified sample storage set according to the random bit sequence, each sample storage set can be directly read into a memory for processing, each sample is obtained, the number C of the sample storage set is subjected to modulo by the bit sequence of the sample, the sample storage set sequence Index is obtained, the current samples are represented and stored into the sample storage sets corresponding to the Index, and the Index values calculated by the samples are identical if the repeated codes exist, and the Index values calculated by the samples are distributed into the same sample storage set;