KR20170074418A

KR20170074418A - Apparatus and method for converting k-mer for measuring similarity of sequences

Info

Publication number: KR20170074418A
Application number: KR1020150183675A
Authority: KR
Inventors: 오정수; 김경수; 최치환
Original assignee: 주식회사 코아아이티
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2017-06-30

Abstract

The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, which makes it possible to quickly measure the similarity of a large-capacity sequence. A k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the present invention comprises: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.

Description

[0001] APPARATUS AND METHOD FOR CONVERTING K-MER FOR MEASURING SIMILARITY OF SEQUENCES [0002]

The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, and more particularly, to a k-mer conversion apparatus and method for measuring similarity of a sequence, which enables rapid measurement of the similarity of a large-capacity sequence.

In general, the determination of the similarity between sequences is essential for finding a homologous protein or common ancestral gene between two given sequences. Among the methods of comparing similarity among sequences, the most accurate method known at present is global alignment using NW (Needleman-Wunsch) algorithm. However, since this method is slow, it is difficult to target a large amount of sequences.

In order to solve these problems, several sequence similarity measurement algorithms have been introduced. Among them, the sequence similarity measurement technique using the k-mer distance calculation is less accurate than the NW algorithm. However, It is widely used for comparison of similarity.

The conventional sequence similarity measuring method is a method in which two sequence data out of a plurality of sequence aggregation data and reference value data set from the outside are inputted from a database, for example, as disclosed in Korean Patent Registration No. 1479735, A first step of sorting by a processor according to a sequence pair of sequence data; After the first step, the processing module of the processor to which the k-mer algorithm for calculating the k-mer distance using the k-mer profile of the two sequence data, the length of the sequence and the value according to the k- A second step of extracting a first result value by calculating a k-mer distance of a sequence pair of the two sequence data; A third step of extracting, from the second step, a second result value which is an ideal value of the reference data among the first result values, from the determination module of the processor; And a fourth step of marshaling the resultant value of the third step in the processing unit and performing parallelization processing.

However, the conventional method of measuring the degree of sequence similarity involves a method of storing the k-mer in advance for each k-mer for each sequence and storing the same in a memory, and thus has a problem that the memory capacity is large and the processing speed is slow.

In addition, the recent generation of NGS (Next Generation Sequencing) technique has produced a large amount of nucleotide sequences compared with the conventional ones. As the necessity for measuring the degree of similarity among such large quantities is increasing, the need for rapid sequence similarity measurement is increasing .

SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for measuring similarity of a sequence, which can speed up a processing speed by not occupying a large memory capacity, mer conversion apparatus and method.

According to an aspect of the present invention, there is provided a k-mer conversion apparatus for measuring a degree of similarity of a sequence, comprising: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacing unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generating unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.

In the k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the above embodiment, the sequence similarity calculation unit may calculate the sequence similarity using the following equation (1)

(1)

Where X and Y represent the sequence, d _{X and Y} represent k-mer distances between sequence X and Y, τ represents k-mer, and n _X (τ) and n _Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l _X and l _Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]

May be used to calculate the sequence similarity.

According to another aspect of the present invention, there is provided a k-mer transformation method for measuring the degree of similarity of a sequence, comprising: inputting two sequences X and Y from a sequence input unit; The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit; generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k; replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map; counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And calculating the similarity (k-mer distance) of the sequence using the total length for each of the sequences X and Y calculated in the calculation step and the frequency for the k-mers counted in the counting step .

According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to the embodiment of the present invention, the total length of each of the sequences X and Y inputted from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers And the degree of similarity of the sequence is calculated so that it does not occupy a large amount of memory capacity, so that the processing speed is fast, and thus it is excellent in measuring the similarity of a large-capacity sequence.

1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.
Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.

1, a k-mer conversion apparatus for measuring the degree of similarity of a sequence according to an embodiment of the present invention includes a sequence input unit 100, a sequence total length calculation unit 200, a k-mer generation unit A k-mer / number substitution unit 400, a k-mer frequency counting unit 500, and a sequence similarity calculation unit 600.

The sequence input unit 100 serves to input two sequences X and Y for estimating the similarity to the sequence total length calculation unit 200 while storing a plurality of sequences.

The sequence total length calculator 200 calculates the total lengths (l _x , l _y ) of the sequences X and Y input from the sequence input unit 100.

The k-mer generation unit 300 divides each of the sequences X and Y inputted from the sequence input unit 100 according to the set k to generate a plurality of k-mers. k means a constant value indicating the length of the k-mer, can be arbitrarily set by the user, and can be appropriately adjusted to increase the sensitivity and accuracy.

The k-mer / number substitution unit 400 substitutes k-mers for each of the sequences X and Y generated by the k-mer generation unit 300 by using a hash map (e.g., 1,2 , 3, 4, 5, and so on).

The k-mer frequency counting unit 500 counts the frequency of k-mers that have been replaced by numbers in the k-mer / number replacing unit 400. That is, instead of counting the number of times the k-mer itself exists in each of the sequences X and Y, the number of times the k-mer replacement number exists in each of the sequences X and Y is counted. Therefore, the capacity occupied by the memory is greatly reduced, and the processing speed is increased. Therefore, it is suitable for measuring the similarity of a large-capacity sequence.

The sequence similarity calculation unit 600 calculates the total similarity between the total length l _X and l _Y for each of the sequences X and Y calculated by the total sequence length calculation unit 200 and the total length (distance of k-mer; d _{X, Y} ) using the frequency [n _X (τ), n _Y (τ)] for k-mers. The sequence similarity calculation unit 600 calculates the sequence similarity using the following Equation 1

To calculate the sequence similarity.

The sequence input unit 100, the sequence total length calculation unit 200, the k-mer generation unit 300, the k-mer / number substitution unit 400, the k- mer frequency counting unit 500, Although the calculation unit 600 has been described as an individual component, the calculation unit 600 may be configured as a PC (Personal Computer), a notebook, a smart phone, a netbook, and the like.

Hereinafter, a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented in an apparatus for measuring similarity of sequences according to an embodiment of the present invention configured as described above, will be described.

FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the degree of similarity of sequences according to an embodiment of the present invention, ).

First, the sequence input unit 100 inputs two sequences X and Y to the sequence total length calculation unit 200 (S1).

Then, the total length (l _X , l _Y ) of each of the sequences X and Y inputted in the step S1 is calculated by the total sequence length calculation unit 200 (S2)

In step S3, the k-mer generation unit 300 divides each of the sequences X and Y inputted in step S1 according to the set k to generate k-mers.

In step S4, the k-mer / number replacement unit 400 replaces the k-mers for each of the sequences X and Y generated in step S3 with numbers using a hash map.

FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.

3, k is set to 7, "AATAATACTAA" represents sequence X, and "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" -mers. "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" are replaced by "1", "2", "3", "4" and "5" respectively by a hash map.

"TAATACTAA" represents sequence Y, and "TAATACT", "AATACTA", and "ATACTAA" represent k-mers produced by dividing sequence Y. "TAATACT", "AATACTA", and "ATACTAA" are replaced by "3", "4", and "5" respectively by a hash map.

In step S5, the k-mer frequency counting unit 500 counts the frequency of k-mers whose numbers have been replaced in step S4.

In step S6, the sequence similarity calculation unit 600 compares the total length (l _X , l _Y ) for each of the sequences X and Y calculated in step S2 and the k- (K-mer distance) of the sequence is calculated using the above equation (1) using the frequency [n _X (τ), n _Y (τ)].

Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art. As shown in FIG. 4, when comparing the k-mer conversion method for measuring the degree of similarity of a sequence according to the embodiment of the present invention and the prior art (without using the k-mer substitution method) The speed has increased by more than 2 times, and the memory efficiency has been improved by 30% to 60%.

According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to an embodiment of the present invention, the total length of each of the sequences X and Y input from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers Ring by counting the number of occurrences, and configured to calculate the degree of similarity between the sequences being the not take up a lot of memory capacity, faster processing speed, and thus is suitable for measuring the degree of similarity of the large sequence.

Although the best mode has been shown and described in the drawings and specification, certain terminology has been used for the purpose of describing the embodiments of the invention and is not intended to be limiting or to limit the scope of the invention described in the claims. It is not. Therefore, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

100: sequence input unit 200: sequence total length calculation unit
300: k-mer generation unit 400: k-mer /
500: k-mer frequency counting unit 600: sequence similarity calculating unit

Claims

A sequence input unit configured to input two sequences X and Y;
A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit;
A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k;
A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map;
A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; And
(K-mer distance) using the total length of each of the sequences X and Y and the frequency of k-mers counted by the k-mer frequency counting unit, calculated by the total length calculating unit of the sequence, And a k-mer conversion unit for calculating the similarity of the sequence.

The method according to claim 1,
The sequence similarity calculation unit calculates the sequence similarity by using the following Equation 1

(1)

Where X and Y represent the sequence, d _{X and Y} represent k-mer distances between sequence X and Y, τ represents k-mer, and n _X (τ) and n _Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l _X and l _Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]

Wherein the k-mer conversion apparatus is configured to calculate the sequence similarity using the k-mer conversion apparatus.

A k-mer conversion method for measuring the similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the similarity of a sequence as set forth in claim 1 or 2, comprising:
A step of inputting two sequences X and Y from a sequence input unit;
The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit;
generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k;
replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map;
counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And
(K-mer distance) using the total length of each of the sequences X and Y calculated in the calculation step and the frequency of k-mers counted in the counting step. A k-mer transformation method for measuring the similarity of sequences.