CN112765171A

CN112765171A - Optimization algorithm for multi-field combined index access of block chain data uplink

Info

Publication number: CN112765171A
Application number: CN202110038939.7A
Authority: CN
Inventors: 洪薇; 洪健; 李京昆; 刘文思
Original assignee: Hubei Chenweixi Chain Information Technology Co ltd
Current assignee: Hubei Chenweixi Chain Information Technology Co ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-05-07
Anticipated expiration: 2041-01-12
Also published as: CN112765171B

Abstract

The invention discloses an optimization algorithm for multi-field combination index access of block chain data uplink, which comprises the following operation steps: obtaining the structure of a data table by an SQL command provided by a database system or an embedded tool thereof; extracting a combined index of the combination; carrying out DISTINCT deduplication operation on the value of each field in the combined index to calculate the unique value count of each field value of the extracted combined index; sorting the numerical values in the combined index, wherein an ORDER (ORDER method) is adopted in a sorting method; identifying the non-decreasing sequencing to find out the position of the breakpoint; and segmenting the data according to the break points, and reading the data according to the segmentation. The invention has the advantages that on one hand, the performance of the index is fully combined to improve the reading efficiency, and in addition, the two aims of maximizing parallel processing and minimizing reading batches are served, the highest efficiency of reading and processing is realized, and meanwhile, the non-overlapping property of the data range is ensured to ensure the correctness of the application function.

Description

Optimization algorithm for multi-field combined index access of block chain data uplink

Technical Field

The present invention relates to the field of block chains, and more particularly, to an algorithm for optimizing multi-field combinatorial index access for data uplink in a block chain.

Background

Relational database systems have been widely used in various fields and industries for their structured data management capabilities and standardized SQL interfaces. The data can be conveniently exported into a byte stream file and a specific format file supported by a database system by using a data export tool of the database. The index function of the database system provides great convenience for the record-level search and positioning, the search and positioning of single records can be rapidly carried out, the selection is carried out according to the range of the single-word segment value, and under the condition of multi-thread processing, each thread can carry out parallel processing and analysis work in the interval with mutually isolated data ranges according to the value of the index field.

However, in the case of a multi-field combined unique index, on the premise of effectively utilizing the index, how to accurately divide each isolated data range, and achieve the two objectives of maximizing parallel processing or minimizing reading batches, the conventional practice is generally to locate and divide by using a single index field or a partial index, the advantage of this method is simple implementation, but the disadvantage is that on the one hand the inability to combine the performance of the indexes sufficiently results in a reduction of the reading efficiency, in addition, because the value space of a single field or a part of fields can not reflect the value space of the whole field of the combined index, the processing data ranges of all threads are overlapped, i.e., there are cases where data is repeatedly read and repeatedly processed, which increases processing time and resource overhead, in addition, the above problem is solved by an optimization algorithm for multi-field combination index access for blockchain data uplink.

Disclosure of Invention

The invention realizes a novel method for parallelization reading and processing of the combined index, can serve two aims of maximizing parallel processing and minimizing reading batches, realizes the reading and processing efficiency with the highest efficiency, and simultaneously ensures the non-overlapping property of data ranges so as to ensure the correctness of application functions.

The technical purpose of the invention is realized by the following technical scheme:

an optimization algorithm for multi-field combinatorial index access for block-chain data uplink, comprising the following steps:

s1, acquiring a structure of a data table through an SQL command or an embedded tool thereof provided by a database system;

s2, extracting the combined indexes, and if a plurality of combined indexes exist, randomly selecting one of the combined indexes; (Combined index is an identifier for reading data, and one of the identifiers is selected for operation)

S3, carrying out DISTINCT duplicate removal operation on the value of each field in the combined index, and calculating the unique value COUNT of each field value of the extracted combined index, wherein the unique value COUNT is represented as COUNT (DISTINCT ());

s4, sorting numerical values in the combined index, wherein an ORDER (ORDER extractor) is adopted as a sorting method;

s5, identifying the non-decreasing sequence and finding out the position of a breakpoint;

and S6, segmenting the data according to the break points, and reading the data according to the segmentation to ensure the integrity of the read data.

Further, in step S4: a.2 performs the following for each field of the combined index of step 4. a.1:

scanning all values of the field, and searching for a breakpoint, that is, a breakpoint that does not satisfy the non-decreasing order, taking the sorting result of step 4.a.1 as an example, the field value with the blue mark is the breakpoint of the field.

After all treatments, the results were as follows:

the first sorting field, i.e., F3, has no breakpoint because all values thereof satisfy the non-decreasing order, and thus, the processing thereof can be skipped during the actual operation and algorithm implementation.

A.3 Range-cut with the results of step A.2:

a.3.1 sequentially scanning each row of the combined index until a row containing a blue flag column is encountered;

a.3.2 marking the upper row containing the blue mark column as green as the lower boundary of the interval;

a.3.3 continue scanning from the row containing the blue flag column until the next row containing the blue flag column is encountered;

and A.3.4, taking the step A.3.3 as a reference, jumping to the step A.3.2 for circulation until all data lines are scanned, and obtaining the following results:

the row marked with green a.4 is the lower boundary of the range block, and the upper boundary is the white row or itself, i.e. the range block cannot contain more than 1 row of green-marked rows.

Further, the upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,2]}

{[1,2,1],[1,2,1]}

{[2,1,1],[2,1,2]}

{[2,2,1],[2,2,1]}

{[3,1,1],[3,1,2]}

{[3,2,1],[3,2,1]}

{[4,1,2],[4,1,2]}

A.5, the result of step a.4 defines the range of the interval for data processing, and any access across two intervals may result in incomplete or lost data acquisition.

Further, b.1, according to the unique value counts of each field of the combined index calculated in step 3, arranged from small to large, taking the data in step 3 as an example, i.e. F1, F2, F3, performs the following operations:

ORDER BY F1 ASC,F2 ASC,F3 ASC

the ranking results were obtained as follows:

F1,F2,F3

{1，1，1}

{1，1，2}

{1，1，3}

{1，2，1}

{1，2，2}

{1，2，3}

{1，2，4}

{2，1，1}

{2，1，2}

{2，1，3}

b.2, for each field of the combined index of the step B.1, the following operations are carried out:

and scanning all values of the field, and searching for a breakpoint, namely, the breakpoint does not satisfy the non-decreasing sequence, and taking the sorting result of the step b.1 as an example, the field value with the blue mark is the breakpoint of the field. After all treatments, the results were as follows:

the first sorting field, i.e., F1, has no breakpoint because all values thereof satisfy the non-decreasing order, and thus, the processing thereof can be skipped during the actual operation and algorithm implementation.

Further, b.3 performs range splitting with the results of step b.2:

b.3.1 sequentially scanning each row of the combined index until a row containing a blue flag column is encountered;

b.3.2, marking the upper row containing the blue mark column as green as a lower boundary of the interval;

b.3.3 continuing the scan from the row containing the blue flag column until the next row containing the blue flag column is encountered;

b.3.4 with reference to step B.3.3, jumping to step B.3.2 to circulate until all data lines are scanned, and the results are as follows:

the row marked as green in b.4 is the lower boundary of the range block, and the upper boundary is the white row or itself, i.e. the range block cannot contain more than 1 row of green-marked rows.

Further, the upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,3]}

{[1,2,1],[1,2,4]}

{[2,1,1],[2,1,3]}

B.5, the result of step b.4 defines the range of the data processing interval, and any access across two intervals may result in incomplete or lost data acquisition.

Further, consider a special case, i.e., where there is no blue label column, take the following data as an example:

F1,F2,F3

{1，1，1}

{2，2，2}

{3，3，3}

{3，4，5}

in this case, the partition of the interval range may be 1 interval to N intervals, where N is equal to the number of rows, and in this case, is equal to the number of entries for each field value.

1 interval:

upper and lower boundaries

{[1,1,1],[3,4,5]}

2 intervals, 3 cases in total:

2.1：

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[3,4,5]}

2.2：

Upper and lower boundaries

{[1,1,1],[2,2,2]}

{[3,3,3],[3,4,5]}

2.3：

Upper and lower boundaries

{[1,1,1],[3,3,3]}

{[3,4,5],[3,4,5]}

3 intervals, 3 cases in total:

3.1：

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[2,2,2]}

{[3,3,3],[3,4,5]}

3.2：

Upper and lower boundaries

{[1,1,1],[2,2,2]}

{[3,3,3],[3,3,3]}

{[3,4,5],[3,4,5]}

3.3：

Upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[3,3,3]}

{[3,4,5],[3,4,5]}

4 intervals:

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[2,2,2]}

{[3,3,3],[3,3,3]}

{[3,4,5],[3,4,5]}

6. The above formalization rules can be extended to any number of cases of combining primary keys/indices.

In conclusion, the invention has the following beneficial effects:

the invention realizes a novel method for parallelizing reading and processing of the combined index, but for the condition of combining the unique index of a plurality of fields, on the premise of effectively utilizing the index, the method has the advantages that on one hand, the performance of the combined index is fully improved, the reading efficiency is improved, in addition, two aims of maximizing parallel processing and minimizing reading batches are served, the highest-efficiency reading and processing efficiency is realized, and meanwhile, the non-overlapping property of the data range is ensured, so that the correctness of the application function is ensured.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

In a preferred embodiment of the present invention, an algorithm for optimizing multi-field combination index access for uplink of block chain data comprises the following steps:

In step S4: a.2 performs the following for each field of the combined index of step 4. a.1:

scanning all values of the field, and searching for a breakpoint, that is, a breakpoint that does not satisfy the non-decreasing order, taking the sorting result of step 4.a.1 as an example, the field value with the blue mark is the breakpoint of the field. After all treatments, the results were as follows:

A.3 Range-cut with the results of step A.2:

The upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,2]}

{[1,2,1],[1,2,1]}

{[2,1,1],[2,1,2]}

{[2,2,1],[2,2,1]}

{[3,1,1],[3,1,2]}

{[3,2,1],[3,2,1]}

{[4,1,2],[4,1,2]}

B.1, according to the unique value counts of each field of the combined index calculated in the step 3, the unique value counts are arranged from small to large, taking the data in the step 3 as an example, namely F1, F2 and F3, the following operations are executed:

ORDER BY F1 ASC,F2 ASC,F3 ASC

the ranking results were obtained as follows:

F1,F2,F3

{1，1，1}

{1，1，2}

{1，1，3}

{1，2，1}

{1，2，2}

{1，2，3}

{1，2，4}

{2，1，1}

{2，1，2}

{2，1，3}

B.3 Range-cut with the results of step B.2:

The upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,3]}

{[1,2,1],[1,2,4]}

{[2,1,1],[2,1,3]}

Example 2

In a preferred embodiment of the present invention, an optimization algorithm for multi-field combination index access for uplink of block chain data considers a special case, that is, a case where no blue label column exists, taking the following data as an example:

F1,F2,F3

{1，1，1}

{2，2，2}

{3，3，3}

{3，4，5}

1 interval:

upper and lower boundaries

{[1,1,1],[3,4,5]}

2 intervals, 3 cases in total:

2.1：

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[3,4,5]}

2.2：

Upper and lower boundaries

{[1,1,1],[2,2,2]}

{[3,3,3],[3,4,5]}

2.3：

Upper and lower boundaries

{[1,1,1],[3,3,3]}

{[3,4,5],[3,4,5]}

3 intervals, 3 cases in total:

3.1：

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[2,2,2]}

{[3,3,3],[3,4,5]}

3.2：

Upper and lower boundaries

{[1,1,1],[2,2,2]}

{[3,3,3],[3,3,3]}

{[3,4,5],[3,4,5]}

3.3：

Upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[3,3,3]}

{[3,4,5],[3,4,5]}

4 intervals:

upper and lower boundaries

{[1,1,1],[1,1,1]}

{[2,2,2],[2,2,2]}

{[3,3,3],[3,3,3]}

{[3,4,5],[3,4,5]}

(DISTINCT: performing deduplication operations on the field value array, such as {1, 1,2, 3, 3}, and changing the field value array into {1, 2, 3} after the DISTINCT operations;

COUNT: the number of elements in the field value array is counted, COUNT ({1, 2, 3}) -3.

Taking the combination index { F1, F2, F3} as an example, the values are as follows:

F1,F2,F3

{1，1，1}

{1，1，2}

{1，1，3}

{1，2，1}

{1，2，2}

{1，2，3}

{1，2，4}

{2，1，1}

{2，1，2}

{2，1，3}

the unique value count of F1 is 2, 1,2 respectively;

the unique value count of F2 is 2, 1,2 respectively;

the unique value count of F3 is 4, 1,2, 3, 4 respectively;

s4, according to the target of the second type of optimization:

maximizing parallel processing

B, minimizing the read batch,

The unique value counts of each field of the combined index calculated in the step 3 are arranged from large to small, and taking the data in the step 3 as an example, namely F3, F1 and F2, the following operations are executed:

ORDER BY F3 ASC,F1 ASC,F2 ASC

the ranking results were obtained as follows:

F3,F1,F2

{1，1，1}

{1，1，2}

{1，2，1}

{2，1，1}

{2，1，2}

{2，2，1}

{3，1，1}

{3，1，2}

{3，2，1}

{4，1，2}

ORDER BY: carrying out sorting operation on the field values;

ASC: used in combination with ORDER BY, i.e. sorted in ascending ORDER, for example, F1 ═ 1, 3, 2, 0, after the ORDER BY F1 ASC operation, becomes {0, 1,2, 3 };

ORDER BY F2 ASC, F1 ASC: the sorting is first performed in ascending order according to the value of F1, and if the values of F1 are the same, the sorting is performed in ascending order according to the value of F2.

In summary, the following steps: the invention realizes a novel method for parallelization reading and processing of the combined index, can serve two aims of maximizing parallel processing and minimizing reading batches, realizes the reading and processing efficiency with the highest efficiency, and simultaneously ensures the non-overlapping property of data ranges so as to ensure the correctness of application functions.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An optimization algorithm for multi-field combinatorial index access for block-chain data uplink, comprising: the method comprises the following operation steps:

s2, extracting the combined indexes, and if a plurality of combined indexes exist, randomly selecting one of the combined indexes;

s3, carrying out DISTINCT duplicate removal operation on the value of each field in the combined index to calculate the unique value count of each field value of the extracted combined index;

2. The algorithm of claim 1, wherein the algorithm comprises: in the step S4: a.2 performs the following for each field of the combined index of step 4. a.1:

3. The algorithm of claim 2, wherein the algorithm comprises: after scanning all treatments, the results were as follows:

the first sorting field, i.e., F3, has no breakpoint because all values thereof satisfy the non-decreasing order, so that in the actual operation and algorithm implementation process, the processing thereof can be skipped,

a.3 Range-cut with the results of step A.2:

4. The algorithm of claim 3, wherein the algorithm comprises: the upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,2]}

{[1,2,1],[1,2,1]}

{[2,1,1],[2,1,2]}

{[2,2,1],[2,2,1]}

{[3,1,1],[3,1,2]}

{[3,2,1],[3,2,1]}

{[4,1,2],[4,1,2]}

5. The algorithm of claim 4, wherein the algorithm comprises: b.1, according to the unique value count of each field of the combined index calculated in the step 3, arranging from small to large, and executing the following operations:

ORDER BY F1 ASC,F2 ASC,F3 ASC

the ranking results were obtained as follows:

F1,F2,F3

{1，1，1}

{1，1，2}

{1，1，3}

{1，2，1}

{1，2，2}

{1，2，3}

{1，2，4}

{2，1，1}

{2，1，2}

{2，1，3}。

6. the algorithm of claim 5, wherein the algorithm comprises: b.2, for each field of the combined index of the step B.1, the following operations are carried out:

scanning all values of the field, searching for a breakpoint, that is, not satisfying the non-decreasing order, taking the sorting result of step b.1 as an example, the field value with the blue mark is the breakpoint of the field, and after all processing, the result is as follows:

7. The algorithm of claim 6, wherein the algorithm comprises: b.3 Range-cut with the results of step B.2:

8. The algorithm of claim 7, wherein the algorithm comprises: the upper and lower boundaries of the finally formed data block are:

upper and lower boundaries

{[1,1,1],[1,1,3]}

{[1,2,1],[1,2,4]}

{[2,1,1],[2,1,3]}