Disclosure of Invention
In order to solve the above problems, the present invention provides a suffix array adaptive method for automatically determining a data merging method by using a decision tree, so as to improve the efficiency of suffix array merging.
The technical scheme adopted by the invention for solving the problems is as follows:
a suffix array adaptive merging method comprises the following steps:
acquiring the size of available memory of a current system;
acquiring the size and the type of source data of two suffix arrays to be combined;
acquiring the size and the type of two suffix arrays to be combined;
and inputting the available memory size of the current system, the source data size and type of the two suffix arrays to be merged and the size and type of the two suffix arrays to be merged as parameters into a trained adaptive merging model, wherein the adaptive merging model merges the two suffix arrays to be merged by a merging method consuming the shortest merging time.
Further, by training a decision tree as the adaptive merging model, the training method for the decision tree includes the following steps:
calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same or different types under different currently available memories, and taking the merging method with the shortest time consumption as a selection merging method;
and dividing the training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, and the verification set is used for verifying and pruning the decision tree after the decision tree is trained by using the training set.
Further, the merging method for merging the training data comprises a merging and sorting method, a method for reconstructing a suffix array after merging the source data, and a merging algorithm for directly merging the suffix array.
Furthermore, before the training data are combined, the source data of each type are respectively split and integrated into data with different magnitude levels so as to adapt to the requirements of the decision tree on the input parameters.
Further, the adaptive merging model selection merging method uses a current available memory coefficient as one of the judgment parameters, the current available memory coefficient is related to the available memory size of the current system, the source data size of the two suffix arrays to be merged and the size of the two suffix arrays to be merged, and the calculation formula is as follows:
wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively.
Further, after the adaptive merging model calculates the current available memory coefficient, the current available memory coefficient is discretized.
A suffix array adaptive merging apparatus, comprising:
the memory size acquisition module is used for acquiring the size of the available memory of the current system;
the data reading module is used for acquiring the size and the type of source data of two suffix arrays to be combined;
the suffix distinguishing module is used for obtaining the size and the type of two suffix arrays to be combined;
and the merging module is used for calling parameters in the memory size acquisition module, the data reading module and the suffix distinguishing module and inputting the parameters into the trained self-adaptive merging model, and the self-adaptive merging model merges two suffix arrays to be merged by a merging method consuming the shortest merging time.
A memory device having stored therein a plurality of instructions adapted to be loaded and executed by a processor:
acquiring the size of available memory of a current system;
acquiring the size and the type of source data of two suffix arrays to be combined;
acquiring the size and the type of two suffix arrays to be combined;
and inputting the available memory size of the current system, the source data size and type of the two suffix arrays to be merged and the size and type of the two suffix arrays to be merged as parameters into a trained adaptive merging model, wherein the adaptive merging model merges the two suffix arrays to be merged by a merging method consuming the shortest merging time.
The invention has the beneficial effects that: according to the method, the adaptive merging model of the suffix array is constructed by training the decision tree, the decision tree automatically selects the merging method with the lowest time consumption on a computer with specified configuration according to the size and the type of the source data of the suffix array to be merged, the current available memory and the size and the type of the suffix array, the merging efficiency of an index system of a so-called bottom-layer index structure of the suffix array is accelerated, and the efficiency influence of the traditional single inefficient merging method on the maintenance of full-text index is avoided.
Detailed Description
Referring to fig. 1 and 2, as a first embodiment, specifically, a suffix array adaptive merging method, which is used as the adaptive merging model by training a decision tree, includes the following steps:
A1. collecting a plurality of data of various types (including gene data, binary data, text data, video and voice data and the like), and respectively splitting and integrating the data of various types into data of different magnitudes;
A2. creating suffix arrays of different types for each data of the step A1 and recording the types and sizes of the suffix arrays;
A3. setting the current available memory coefficient, wherein the calculation formula is as follows:
wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively;
A4. discretizing the current memory available coefficient of the step A3;
A5. combining all suffix arrays in the step A2 pairwise under the condition of current internal memory available coefficients in different interval ranges by using three combining algorithms, and taking the combining method with the shortest time as a selection method; the merging algorithm comprises a merging and sorting method, a method for merging source data and recreating a suffix array and a direct merging algorithm;
A6. dividing the training results obtained in the step A5 into two groups, wherein one group is a training set, and the other group is a verification set;
A7. and after the decision tree is trained by using the training set, verifying and pruning the decision tree by using the verification set to finally obtain the self-adaptive merging model of the suffix array suitable for the current machine configuration.
It should be noted that the direct merging algorithm in the merging method is not specified as any fixed algorithm, and all the algorithms for directly merging suffix arrays should be within the range of the direct merging algorithm; meanwhile, in the step a5, a merging algorithm may be added continuously, but is not limited to three merging algorithms.
After the training of the decision tree is completed, the self-adaptive merging model inherits all the characteristics of the decision tree, and the following steps are executed for merging two suffix arrays to be merged:
B1. acquiring the size and the type of two suffix arrays to be combined;
B2. acquiring the size and the type of source data of two suffix arrays to be combined;
B3. acquiring an available physical memory of a current computer;
B4. inputting the data in the steps B1-B3 into the self-adaptive merging model as parameters, simultaneously calculating the available coefficient of the current memory by the self-adaptive merging model, discretizing the available coefficient of the current memory, and obtaining the shortest merging method used according to the data;
B4. and combining the two suffix arrays to be combined according to the method obtained in the step B4.
Referring to fig. 5, as a second embodiment, according to the merging method of the first embodiment, there is a suffix array adaptive merging apparatus including the following modules:
the memory size acquisition module is used for acquiring the size of the available memory of the current system;
the data reading module is used for acquiring the size and the type of source data of two suffix arrays to be combined;
the suffix distinguishing module is used for obtaining the size and the type of two suffix arrays to be combined;
and the merging module is used for calling parameters in the memory size acquisition module, the data reading module and the suffix distinguishing module and inputting the parameters into the trained self-adaptive merging model, and the self-adaptive merging model merges two suffix arrays to be merged by a merging method consuming the shortest merging time.
And the model training module is used for training the decision tree.
The model training module is used for calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same and different types under different currently available memories, and taking the merging method with the lowest time consumption as the electing merging method; and dividing training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, the training set is used for training the decision tree, then the verification set is used for verifying and pruning the decision tree, and the trained decision tree is input into the merging module to be used as an adaptive merging model.
The merging method for merging the training data by the model training module comprises a merging and sorting method, a method for reconstructing a suffix array after merging the source data and a merging algorithm for directly merging the suffix array.
The self-adaptive merging model selection merging method takes a current available memory coefficient as one of judgment parameters, the current available memory coefficient is related to the available memory size of a current system, the source data sizes of two suffix arrays to be merged and the sizes of the two suffix arrays to be merged, and the calculation formula is as follows:
wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively.
And the merging module is used for discretizing the current available memory coefficient after calculating the current available memory coefficient.
Also, there is a storage device having stored therein a plurality of instructions adapted to be loaded and executed by a processor to:
acquiring the size of available memory of a current system;
acquiring the size and the type of source data of two suffix arrays to be combined;
acquiring the size and the type of two suffix arrays to be combined;
and inputting the parameters into a trained self-adaptive merging model, and merging the two suffix arrays to be merged by the self-adaptive merging model by a merging method consuming the shortest merging time.
Further comprising instructions for training the decision tree: calculating the time consumed by merging the training data of the same or different data types by adopting suffix arrays of the same or different types under different currently available memories, and taking the merging method with the lowest time consumption as the elected merging method; and dividing training results into two groups, wherein one group is used as a training set, the other group is used as a verification set, the training set is used for training the decision tree, then the verification set is used for verifying and pruning the decision tree, and the trained decision tree is input into the merging module to be used as an adaptive merging model.
In the instruction for training the decision tree, the merging method for merging the training data includes a merging and sorting method, a method for reconstructing a suffix array after merging the source data, and a merging algorithm for directly merging the suffix array.
The self-adaptive merging model selection merging method takes a current available memory coefficient as one of judgment parameters, the current available memory coefficient is related to the available memory size of a current system, the source data sizes of two suffix arrays to be merged and the sizes of the two suffix arrays to be merged, and the calculation formula is as follows:
wherein a is the current available memory coefficient, M is the current system available physical memory, Data1 and Data2 are the sizes of the source Data of the two suffix arrays to be merged respectively, and SA1 and SA2 are the sizes of the two suffix arrays to be merged respectively.
After the current memory available coefficient is calculated, the current memory available coefficient is discretized.
Referring to fig. 3 and 4 as a third embodiment, according to the method of the first embodiment, after training the decision tree and generating the adaptive merging model, which includes the merging methods A, B and C, the system receives two suffix arrays to be merged, such as a specific workflow.
The two suffix arrays to be merged are SA1 and SA2, respectively, and the source data of the two suffix arrays to be merged are s1 and s2, respectively, where the size of s1 is greater than the size of s 2.
The source data s1 is a text data type with a size of 3GB, and its suffix array represents the address of a single-byte unit data in s1 every five bytes, so its suffix array SA1 has a size of 15 GB; the source data s2 is a binary data type with a size of 1GB, and the suffix array indicates the address of a single byte unit of data in s2 every four bytes, so the suffix array size is 4 GB; neither suffix array takes a compressed form.
The size of the current available physical memory M is 2.5 GB;
from the above data, the current available memory coefficient can be calculated as
And (C) introducing the types of the source data s1 and s2 and the current available memory coefficient into an adaptive merging model, and obtaining a suffix array merging method suitable for the two source data, wherein the suffix array merging method is B.
The three embodiments train the decision tree by three or more merging methods, the formed decision tree determines the merging method according to the discretized currently available memory coefficient and the source data type, and merges two suffix arrays to be merged.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and the present invention shall fall within the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means.