CN108664459B

CN108664459B - Suffix array self-adaptive merging method and device thereof

Info

Publication number: CN108664459B
Application number: CN201810243160.7A
Authority: CN
Inventors: 郇宜东; 徐文涛; 解静仪; 农革
Original assignee: Foshan Shunde Sun Yat-Sen University Research Institute; Sun Yat Sen University; SYSU CMU Shunde International Joint Research Institute
Current assignee: Sun Yat Sen University
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-09-17
Anticipated expiration: 2038-03-22
Also published as: CN108664459A

Abstract

The invention discloses a suffix array self-adaptive merging method and a device thereof, wherein a suffix array self-adaptive merging model is constructed by training a decision tree, the decision tree automatically selects a merging method with the lowest time consumption on a computer with specified configuration according to the size and the type of source data of a suffix array to be merged, the current available memory and the size and the type of the suffix array, the merging efficiency of an index system of a so-called bottom-layer index structure of the suffix array is accelerated, and the efficiency influence of the traditional single inefficient merging method on the maintenance of full-text indexes is avoided.

Description

Suffix array self-adaptive merging method and device thereof

Technical Field

The present invention relates to the field of data processing, and more particularly, to a suffix array adaptive merging method and apparatus.

Background

The suffix array is a very good substitute for the suffix tree, enabling many of the functions of the suffix tree to be performed with almost uniform time complexity, and more importantly, the suffix array occupies much less space than the suffix tree. After the suffix array is proposed, the suffix array is widely applied to the fields of character string matching, full-text indexing, biological information retrieval and the like.

The application of suffix arrays to the field of full-text indexing is extremely important for merging suffix arrays of a plurality of files. Suffix arrays are the sorting of suffixes of characters, and the merging can be divided into three methods:

1. a merging ordering method, the time complexity of which is nlog (n), and external ordering requires a plurality of disk IO;

2. the method for merging the source data and recreating the suffix array has better linear time and memory construction algorithm operation efficiency;

3. a direct merging algorithm adopts a method of derivation from back to front, and merging efficiency is determined according to data types;

different construction forms may be adopted for suffix arrays constructed from a piece of data, and time efficiency consumed by adopting different merging methods is different for different suffix array types. At present, in the field of full-text indexes, a suffix array form merging method is single, merging efficiency is low, and the method cannot adapt to various data types involved in the full-text indexes, so that data maintenance is difficult.

Disclosure of Invention

In order to solve the above problems, the present invention provides a suffix array adaptive method for automatically determining a data merging method by using a decision tree, so as to improve the efficiency of suffix array merging.

The technical scheme adopted by the invention for solving the problems is as follows:

a suffix array adaptive merging method comprises the following steps:

acquiring the size of available memory of a current system;

acquiring the size and the type of source data of two suffix arrays to be combined;

acquiring the size and the type of two suffix arrays to be combined;

and inputting the available memory size of the current system, the source data size and type of the two suffix arrays to be merged and the size and type of the two suffix arrays to be merged as parameters into a trained adaptive merging model, wherein the adaptive merging model merges the two suffix arrays to be merged by a merging method consuming the shortest merging time.

Further, by training a decision tree as the adaptive merging model, the training method for the decision tree includes the following steps:

calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same or different types under different currently available memories, and taking the merging method with the shortest time consumption as a selection merging method;

and dividing the training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, and the verification set is used for verifying and pruning the decision tree after the decision tree is trained by using the training set.

Further, the merging method for merging the training data comprises a merging and sorting method, a method for reconstructing a suffix array after merging the source data, and a merging algorithm for directly merging the suffix array.

Furthermore, before the training data are combined, the source data of each type are respectively split and integrated into data with different magnitude levels so as to adapt to the requirements of the decision tree on the input parameters.

Further, the adaptive merging model selection merging method uses a current available memory coefficient as one of the judgment parameters, the current available memory coefficient is related to the available memory size of the current system, the source data size of the two suffix arrays to be merged and the size of the two suffix arrays to be merged, and the calculation formula is as follows:

wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively.

Further, after the adaptive merging model calculates the current available memory coefficient, the current available memory coefficient is discretized.

A suffix array adaptive merging apparatus, comprising:

the memory size acquisition module is used for acquiring the size of the available memory of the current system;

the data reading module is used for acquiring the size and the type of source data of two suffix arrays to be combined;

the suffix distinguishing module is used for obtaining the size and the type of two suffix arrays to be combined;

and the merging module is used for calling parameters in the memory size acquisition module, the data reading module and the suffix distinguishing module and inputting the parameters into the trained self-adaptive merging model, and the self-adaptive merging model merges two suffix arrays to be merged by a merging method consuming the shortest merging time.

A memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:

acquiring the size of available memory of a current system;

acquiring the size and the type of two suffix arrays to be combined;

The invention has the beneficial effects that: according to the method, the adaptive merging model of the suffix array is constructed by training the decision tree, the decision tree automatically selects the merging method with the lowest time consumption on a computer with specified configuration according to the size and the type of the source data of the suffix array to be merged, the current available memory and the size and the type of the suffix array, the merging efficiency of an index system of a so-called bottom-layer index structure of the suffix array is accelerated, and the efficiency influence of the traditional single inefficient merging method on the maintenance of full-text index is avoided.

Drawings

The invention is further illustrated by the following figures and examples.

FIG. 1 is a flow chart of a merging method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for training a decision tree according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a decision tree determination process according to a third embodiment of the present invention;

FIG. 4 is an example of two data contents to be merged according to a third embodiment of the present invention;

fig. 5 is a schematic diagram of module connection according to a second embodiment of the present invention.

Detailed Description

Referring to fig. 1 and 2, as a first embodiment, specifically, a suffix array adaptive merging method, which is used as the adaptive merging model by training a decision tree, includes the following steps:

A1. collecting a plurality of data of various types (including gene data, binary data, text data, video and voice data and the like), and respectively splitting and integrating the data of various types into data of different magnitudes;

A2. creating suffix arrays of different types for each data of the step A1 and recording the types and sizes of the suffix arrays;

A3. setting the current available memory coefficient, wherein the calculation formula is as follows:

wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively;

A4. discretizing the current memory available coefficient of the step A3;

A5. combining all suffix arrays in the step A2 pairwise under the condition of current internal memory available coefficients in different interval ranges by using three combining algorithms, and taking the combining method with the shortest time as a selection method; the merging algorithm comprises a merging and sorting method, a method for merging source data and recreating a suffix array and a direct merging algorithm;

A6. dividing the training results obtained in the step A5 into two groups, wherein one group is a training set, and the other group is a verification set;

A7. and after the decision tree is trained by using the training set, verifying and pruning the decision tree by using the verification set to finally obtain the self-adaptive merging model of the suffix array suitable for the current machine configuration.

It should be noted that the direct merging algorithm in the merging method is not specified as any fixed algorithm, and all the algorithms for directly merging suffix arrays should be within the range of the direct merging algorithm; meanwhile, in the step a5, a merging algorithm may be added continuously, but is not limited to three merging algorithms.

After the training of the decision tree is completed, the self-adaptive merging model inherits all the characteristics of the decision tree, and the following steps are executed for merging two suffix arrays to be merged:

B1. acquiring the size and the type of two suffix arrays to be combined;

B2. acquiring the size and the type of source data of two suffix arrays to be combined;

B3. acquiring an available physical memory of a current computer;

B4. inputting the data in the steps B1-B3 into the self-adaptive merging model as parameters, simultaneously calculating the available coefficient of the current memory by the self-adaptive merging model, discretizing the available coefficient of the current memory, and obtaining the shortest merging method used according to the data;

B4. and combining the two suffix arrays to be combined according to the method obtained in the step B4.

Referring to fig. 5, as a second embodiment, according to the merging method of the first embodiment, there is a suffix array adaptive merging apparatus including the following modules:

And the model training module is used for training the decision tree.

The model training module is used for calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same and different types under different currently available memories, and taking the merging method with the lowest time consumption as the electing merging method; and dividing training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, the training set is used for training the decision tree, then the verification set is used for verifying and pruning the decision tree, and the trained decision tree is input into the merging module to be used as an adaptive merging model.

The merging method for merging the training data by the model training module comprises a merging and sorting method, a method for reconstructing a suffix array after merging the source data and a merging algorithm for directly merging the suffix array.

The self-adaptive merging model selection merging method takes a current available memory coefficient as one of judgment parameters, the current available memory coefficient is related to the available memory size of a current system, the source data sizes of two suffix arrays to be merged and the sizes of the two suffix arrays to be merged, and the calculation formula is as follows:

And the merging module is used for discretizing the current available memory coefficient after calculating the current available memory coefficient.

Also, there is a storage device having stored therein a plurality of instructions adapted to be loaded and executed by a processor to:

acquiring the size of available memory of a current system;

acquiring the size and the type of two suffix arrays to be combined;

and inputting the parameters into a trained self-adaptive merging model, and merging the two suffix arrays to be merged by the self-adaptive merging model by a merging method consuming the shortest merging time.

Further comprising instructions for training the decision tree: calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same or different types under different currently available memories, and taking the merging method with the lowest time consumption as the elected merging method; and dividing training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, the training set is used for training the decision tree, then the verification set is used for verifying and pruning the decision tree, and the trained decision tree is input into the merging module to be used as an adaptive merging model.

In the instruction for training the decision tree, the merging method for merging the training data includes a merging and sorting method, a method for reconstructing a suffix array after merging the source data, and a merging algorithm for directly merging the suffix array.

After the current memory available coefficient is calculated, the current memory available coefficient is discretized.

Referring to fig. 3 and 4 as a third embodiment, according to the method of the first embodiment, after training the decision tree and generating the adaptive merging model, which includes the merging methods A, B and C, the system receives two suffix arrays to be merged, such as a specific workflow.

The two suffix arrays to be merged are SA1 and SA2, respectively, and the source data of the two suffix arrays to be merged are s1 and s2, respectively, where the size of s1 is greater than the size of s 2.

The source data s1 is a text data type with a size of 3GB, and its suffix array represents the address of a single-byte unit data in s1 every five bytes, so its suffix array SA1 has a size of 15 GB; the source data s2 is a binary data type with a size of 1GB, and the suffix array indicates the address of a single byte unit of data in s2 every four bytes, so the suffix array size is 4 GB; neither suffix array takes a compressed form.

The size of the current available physical memory M is 2.5 GB;

from the above data, the current available memory coefficient can be calculated as

And (C) introducing the types of the source data s1 and s2 and the current available memory coefficient into an adaptive merging model, and obtaining a suffix array merging method suitable for the two source data, wherein the suffix array merging method is B.

The three embodiments train the decision tree by three or more merging methods, the formed decision tree determines the merging method according to the discretized currently available memory coefficient and the source data type, and merges two suffix arrays to be merged.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and the present invention shall fall within the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means.

Claims

1. A suffix array adaptive merging method is characterized by comprising the following steps:

acquiring the size of available memory of a current system;

acquiring the size and the type of two suffix arrays to be combined;

2. The suffix array adaptive merging method according to claim 1, wherein the training of the decision tree is performed by training the decision tree as the adaptive merging model, and comprises the following steps:

3. The suffix array adaptive merging method of claim 2, wherein: the merging method for merging the training data comprises a merging and sorting method, a method for reconstructing a suffix array after merging the source data and a merging algorithm for directly merging the suffix array.

4. The suffix array adaptive merging method of claim 2, wherein: before the training data are combined, the source data of each type are respectively split and integrated into data with different magnitude levels so as to adapt to the requirements of the decision tree on input parameters.

5. The suffix array adaptive merging method of claim 1, wherein: the self-adaptive merging model selection merging method takes a current available memory coefficient as one of judgment parameters, the current available memory coefficient is related to the available memory size of a current system, the source data sizes of two suffix arrays to be merged and the sizes of the two suffix arrays to be merged, and the calculation formula is as follows:

6. The suffix array adaptive merging method of claim 5, wherein: after the self-adaptive merging model calculates the current available memory coefficient, the current available memory coefficient is discretized.

7. A suffix array adaptive merging apparatus, comprising

8. The suffix array adaptive merging apparatus of claim 7, wherein: the model training module is used for training the decision tree; the model training module calculates the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same or different types under different currently available memories, and takes the merging method with the shortest time consumption as the electing merging method; and dividing training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, the training set is used for training the decision tree, then the verification set is used for verifying and pruning the decision tree, and the trained decision tree is input into the merging module to be used as an adaptive merging model.

9. A memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:

acquiring the size of available memory of a current system;

acquiring the size and the type of two suffix arrays to be combined;

and inputting the available memory size of the current system, the source data size and type of the two suffix arrays to be merged and the size and type of the two suffix arrays to be merged into a trained adaptive merging model by taking the available memory size of the current system, the source data size and type of the two suffix arrays to be merged and the size and type of the two suffix arrays to be merged as parameters, wherein the adaptive merging model merges the two suffix arrays to be merged by a merging method consuming minimum merging time.