KR20170074418A - Apparatus and method for converting k-mer for measuring similarity of sequences - Google Patents

Apparatus and method for converting k-mer for measuring similarity of sequences Download PDF

Info

Publication number
KR20170074418A
KR20170074418A KR1020150183675A KR20150183675A KR20170074418A KR 20170074418 A KR20170074418 A KR 20170074418A KR 1020150183675 A KR1020150183675 A KR 1020150183675A KR 20150183675 A KR20150183675 A KR 20150183675A KR 20170074418 A KR20170074418 A KR 20170074418A
Authority
KR
South Korea
Prior art keywords
mer
sequence
sequences
unit
similarity
Prior art date
Application number
KR1020150183675A
Other languages
Korean (ko)
Inventor
오정수
김경수
최치환
Original Assignee
주식회사 코아아이티
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 코아아이티 filed Critical 주식회사 코아아이티
Priority to KR1020150183675A priority Critical patent/KR20170074418A/en
Publication of KR20170074418A publication Critical patent/KR20170074418A/en

Links

Images

Classifications

    • G06F19/22
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F19/24

Abstract

The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, which makes it possible to quickly measure the similarity of a large-capacity sequence. A k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the present invention comprises: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.

Description

[0001] APPARATUS AND METHOD FOR CONVERTING K-MER FOR MEASURING SIMILARITY OF SEQUENCES [0002]

The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, and more particularly, to a k-mer conversion apparatus and method for measuring similarity of a sequence, which enables rapid measurement of the similarity of a large-capacity sequence.

In general, the determination of the similarity between sequences is essential for finding a homologous protein or common ancestral gene between two given sequences. Among the methods of comparing similarity among sequences, the most accurate method known at present is global alignment using NW (Needleman-Wunsch) algorithm. However, since this method is slow, it is difficult to target a large amount of sequences.

In order to solve these problems, several sequence similarity measurement algorithms have been introduced. Among them, the sequence similarity measurement technique using the k-mer distance calculation is less accurate than the NW algorithm. However, It is widely used for comparison of similarity.

The conventional sequence similarity measuring method is a method in which two sequence data out of a plurality of sequence aggregation data and reference value data set from the outside are inputted from a database, for example, as disclosed in Korean Patent Registration No. 1479735, A first step of sorting by a processor according to a sequence pair of sequence data; After the first step, the processing module of the processor to which the k-mer algorithm for calculating the k-mer distance using the k-mer profile of the two sequence data, the length of the sequence and the value according to the k- A second step of extracting a first result value by calculating a k-mer distance of a sequence pair of the two sequence data; A third step of extracting, from the second step, a second result value which is an ideal value of the reference data among the first result values, from the determination module of the processor; And a fourth step of marshaling the resultant value of the third step in the processing unit and performing parallelization processing.

However, the conventional method of measuring the degree of sequence similarity involves a method of storing the k-mer in advance for each k-mer for each sequence and storing the same in a memory, and thus has a problem that the memory capacity is large and the processing speed is slow.

In addition, the recent generation of NGS (Next Generation Sequencing) technique has produced a large amount of nucleotide sequences compared with the conventional ones. As the necessity for measuring the degree of similarity among such large quantities is increasing, the need for rapid sequence similarity measurement is increasing .

SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for measuring similarity of a sequence, which can speed up a processing speed by not occupying a large memory capacity, mer conversion apparatus and method.

According to an aspect of the present invention, there is provided a k-mer conversion apparatus for measuring a degree of similarity of a sequence, comprising: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacing unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generating unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.

In the k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the above embodiment, the sequence similarity calculation unit may calculate the sequence similarity using the following equation (1)

(1)

Figure pat00001

Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]

May be used to calculate the sequence similarity.

According to another aspect of the present invention, there is provided a k-mer transformation method for measuring the degree of similarity of a sequence, comprising: inputting two sequences X and Y from a sequence input unit; The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit; generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k; replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map; counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And calculating the similarity (k-mer distance) of the sequence using the total length for each of the sequences X and Y calculated in the calculation step and the frequency for the k-mers counted in the counting step .

According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to the embodiment of the present invention, the total length of each of the sequences X and Y inputted from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers And the degree of similarity of the sequence is calculated so that it does not occupy a large amount of memory capacity, so that the processing speed is fast, and thus it is excellent in measuring the similarity of a large-capacity sequence.

1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.
Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.

1, a k-mer conversion apparatus for measuring the degree of similarity of a sequence according to an embodiment of the present invention includes a sequence input unit 100, a sequence total length calculation unit 200, a k-mer generation unit A k-mer / number substitution unit 400, a k-mer frequency counting unit 500, and a sequence similarity calculation unit 600.

The sequence input unit 100 serves to input two sequences X and Y for estimating the similarity to the sequence total length calculation unit 200 while storing a plurality of sequences.

The sequence total length calculator 200 calculates the total lengths (l x , l y ) of the sequences X and Y input from the sequence input unit 100.

The k-mer generation unit 300 divides each of the sequences X and Y inputted from the sequence input unit 100 according to the set k to generate a plurality of k-mers. k means a constant value indicating the length of the k-mer, can be arbitrarily set by the user, and can be appropriately adjusted to increase the sensitivity and accuracy.

The k-mer / number substitution unit 400 substitutes k-mers for each of the sequences X and Y generated by the k-mer generation unit 300 by using a hash map (e.g., 1,2 , 3, 4, 5, and so on).

The k-mer frequency counting unit 500 counts the frequency of k-mers that have been replaced by numbers in the k-mer / number replacing unit 400. That is, instead of counting the number of times the k-mer itself exists in each of the sequences X and Y, the number of times the k-mer replacement number exists in each of the sequences X and Y is counted. Therefore, the capacity occupied by the memory is greatly reduced, and the processing speed is increased. Therefore, it is suitable for measuring the similarity of a large-capacity sequence.

The sequence similarity calculation unit 600 calculates the total similarity between the total length l X and l Y for each of the sequences X and Y calculated by the total sequence length calculation unit 200 and the total length (distance of k-mer; d X, Y ) using the frequency [n X (τ), n Y (τ)] for k-mers. The sequence similarity calculation unit 600 calculates the sequence similarity using the following Equation 1

Figure pat00002

Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]

To calculate the sequence similarity.

The sequence input unit 100, the sequence total length calculation unit 200, the k-mer generation unit 300, the k-mer / number substitution unit 400, the k- mer frequency counting unit 500, Although the calculation unit 600 has been described as an individual component, the calculation unit 600 may be configured as a PC (Personal Computer), a notebook, a smart phone, a netbook, and the like.

Hereinafter, a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented in an apparatus for measuring similarity of sequences according to an embodiment of the present invention configured as described above, will be described.

FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the degree of similarity of sequences according to an embodiment of the present invention, ).

First, the sequence input unit 100 inputs two sequences X and Y to the sequence total length calculation unit 200 (S1).

Then, the total length (l X , l Y ) of each of the sequences X and Y inputted in the step S1 is calculated by the total sequence length calculation unit 200 (S2)

In step S3, the k-mer generation unit 300 divides each of the sequences X and Y inputted in step S1 according to the set k to generate k-mers.

In step S4, the k-mer / number replacement unit 400 replaces the k-mers for each of the sequences X and Y generated in step S3 with numbers using a hash map.

FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.

3, k is set to 7, "AATAATACTAA" represents sequence X, and "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" -mers. "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" are replaced by "1", "2", "3", "4" and "5" respectively by a hash map.

"TAATACTAA" represents sequence Y, and "TAATACT", "AATACTA", and "ATACTAA" represent k-mers produced by dividing sequence Y. "TAATACT", "AATACTA", and "ATACTAA" are replaced by "3", "4", and "5" respectively by a hash map.

In step S5, the k-mer frequency counting unit 500 counts the frequency of k-mers whose numbers have been replaced in step S4.

In step S6, the sequence similarity calculation unit 600 compares the total length (l X , l Y ) for each of the sequences X and Y calculated in step S2 and the k- (K-mer distance) of the sequence is calculated using the above equation (1) using the frequency [n X (τ), n Y (τ)].

Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art. As shown in FIG. 4, when comparing the k-mer conversion method for measuring the degree of similarity of a sequence according to the embodiment of the present invention and the prior art (without using the k-mer substitution method) The speed has increased by more than 2 times, and the memory efficiency has been improved by 30% to 60%.

According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to an embodiment of the present invention, the total length of each of the sequences X and Y input from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers Ring by counting the number of occurrences, and configured to calculate the degree of similarity between the sequences being the not take up a lot of memory capacity, faster processing speed, and thus is suitable for measuring the degree of similarity of the large sequence.

Although the best mode has been shown and described in the drawings and specification, certain terminology has been used for the purpose of describing the embodiments of the invention and is not intended to be limiting or to limit the scope of the invention described in the claims. It is not. Therefore, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

100: sequence input unit 200: sequence total length calculation unit
300: k-mer generation unit 400: k-mer /
500: k-mer frequency counting unit 600: sequence similarity calculating unit

Claims (3)

A sequence input unit configured to input two sequences X and Y;
A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit;
A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k;
A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map;
A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; And
(K-mer distance) using the total length of each of the sequences X and Y and the frequency of k-mers counted by the k-mer frequency counting unit, calculated by the total length calculating unit of the sequence, And a k-mer conversion unit for calculating the similarity of the sequence.
The method according to claim 1,
The sequence similarity calculation unit calculates the sequence similarity by using the following Equation 1

(1)
Figure pat00003


Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]

Wherein the k-mer conversion apparatus is configured to calculate the sequence similarity using the k-mer conversion apparatus.
A k-mer conversion method for measuring the similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the similarity of a sequence as set forth in claim 1 or 2, comprising:
A step of inputting two sequences X and Y from a sequence input unit;
The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit;
generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k;
replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map;
counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And
(K-mer distance) using the total length of each of the sequences X and Y calculated in the calculation step and the frequency of k-mers counted in the counting step. A k-mer transformation method for measuring the similarity of sequences.
KR1020150183675A 2015-12-22 2015-12-22 Apparatus and method for converting k-mer for measuring similarity of sequences KR20170074418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020150183675A KR20170074418A (en) 2015-12-22 2015-12-22 Apparatus and method for converting k-mer for measuring similarity of sequences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150183675A KR20170074418A (en) 2015-12-22 2015-12-22 Apparatus and method for converting k-mer for measuring similarity of sequences

Publications (1)

Publication Number Publication Date
KR20170074418A true KR20170074418A (en) 2017-06-30

Family

ID=59279669

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150183675A KR20170074418A (en) 2015-12-22 2015-12-22 Apparatus and method for converting k-mer for measuring similarity of sequences

Country Status (1)

Country Link
KR (1) KR20170074418A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028897A (en) * 2019-12-13 2020-04-17 内蒙古农业大学 Hadoop-based distributed parallel computing method for genome index construction
KR20240006339A (en) 2022-07-06 2024-01-15 주식회사 코아아이티 Device and method of selecting str marker candidates

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028897A (en) * 2019-12-13 2020-04-17 内蒙古农业大学 Hadoop-based distributed parallel computing method for genome index construction
KR20240006339A (en) 2022-07-06 2024-01-15 주식회사 코아아이티 Device and method of selecting str marker candidates

Similar Documents

Publication Publication Date Title
CN110352389B (en) Information processing apparatus and information processing method
US10192028B2 (en) Data analysis device and method therefor
JP5493597B2 (en) Search method and search system
US11288580B2 (en) Optimal solution search method, optimal solution search program, and optimal solution search apparatus
KR20130107889A (en) Aparatus and method for detecting anomalous subsequence
JP6200076B2 (en) Method and system for evaluating measurements obtained from a system
KR20170074418A (en) Apparatus and method for converting k-mer for measuring similarity of sequences
US20150142328A1 (en) Calculation method for interchromosomal translocation position
Sogabe et al. An acceleration method of short read mapping using FPGA
JP2019133305A (en) Chaos gage correction device and program for chaos gage correction
Deorowicz et al. Kalign-LCS—a more accurate and faster variant of Kalign2 algorithm for the multiple sequence alignment problem
He et al. Inference of RNA structural contacts by direct coupling analysis
KR101584857B1 (en) System and method for aligning genome sequnce
CN111936636B (en) Determination of the frequency distribution of nucleotide sequence variants
CN113435599A (en) Information processing apparatus, specifying method, and non-transitory computer-readable storage medium
CN112735596A (en) Similar patient determination method and device, electronic equipment and storage medium
US20200105374A1 (en) Mixture model for targeted sequencing
JP6789253B2 (en) Search device, search method, and program
US20190095483A1 (en) Search apparatus, storage medium, database system, and search method
JP6841039B2 (en) Factor analyzer, factor analysis method, and program
Aldawiri et al. A Novel Approach for Mapping Ambiguous Sequences of Transcriptomes
GUDODAGI et al. Customized Computational Environment for Investigations and Compression of Genomic Data.
Munjal et al. Sequence similarity using composition method
CN108090604A (en) Based on the improved GM of trapezoid formula(1,1)Model prediction method
Weitschek et al. Classifying bacterial genomes with compact logic formulas on k-Mer frequencies

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application