KR20170074418A - Apparatus and method for converting k-mer for measuring similarity of sequences - Google Patents
Apparatus and method for converting k-mer for measuring similarity of sequences Download PDFInfo
- Publication number
- KR20170074418A KR20170074418A KR1020150183675A KR20150183675A KR20170074418A KR 20170074418 A KR20170074418 A KR 20170074418A KR 1020150183675 A KR1020150183675 A KR 1020150183675A KR 20150183675 A KR20150183675 A KR 20150183675A KR 20170074418 A KR20170074418 A KR 20170074418A
- Authority
- KR
- South Korea
- Prior art keywords
- mer
- sequence
- sequences
- unit
- similarity
- Prior art date
Links
Images
Classifications
-
- G06F19/22—
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G06F19/24—
Abstract
The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, which makes it possible to quickly measure the similarity of a large-capacity sequence. A k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the present invention comprises: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.
Description
The present invention relates to a k-mer conversion apparatus and method for measuring the degree of similarity of a sequence, and more particularly, to a k-mer conversion apparatus and method for measuring similarity of a sequence, which enables rapid measurement of the similarity of a large-capacity sequence.
In general, the determination of the similarity between sequences is essential for finding a homologous protein or common ancestral gene between two given sequences. Among the methods of comparing similarity among sequences, the most accurate method known at present is global alignment using NW (Needleman-Wunsch) algorithm. However, since this method is slow, it is difficult to target a large amount of sequences.
In order to solve these problems, several sequence similarity measurement algorithms have been introduced. Among them, the sequence similarity measurement technique using the k-mer distance calculation is less accurate than the NW algorithm. However, It is widely used for comparison of similarity.
The conventional sequence similarity measuring method is a method in which two sequence data out of a plurality of sequence aggregation data and reference value data set from the outside are inputted from a database, for example, as disclosed in Korean Patent Registration No. 1479735, A first step of sorting by a processor according to a sequence pair of sequence data; After the first step, the processing module of the processor to which the k-mer algorithm for calculating the k-mer distance using the k-mer profile of the two sequence data, the length of the sequence and the value according to the k- A second step of extracting a first result value by calculating a k-mer distance of a sequence pair of the two sequence data; A third step of extracting, from the second step, a second result value which is an ideal value of the reference data among the first result values, from the determination module of the processor; And a fourth step of marshaling the resultant value of the third step in the processing unit and performing parallelization processing.
However, the conventional method of measuring the degree of sequence similarity involves a method of storing the k-mer in advance for each k-mer for each sequence and storing the same in a memory, and thus has a problem that the memory capacity is large and the processing speed is slow.
In addition, the recent generation of NGS (Next Generation Sequencing) technique has produced a large amount of nucleotide sequences compared with the conventional ones. As the necessity for measuring the degree of similarity among such large quantities is increasing, the need for rapid sequence similarity measurement is increasing .
SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for measuring similarity of a sequence, which can speed up a processing speed by not occupying a large memory capacity, mer conversion apparatus and method.
According to an aspect of the present invention, there is provided a k-mer conversion apparatus for measuring a degree of similarity of a sequence, comprising: a sequence input unit configured to input two sequences X and Y; A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit; A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k; A k-mer / number replacing unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generating unit with numbers using a hash map; A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; (K-mer distance) using the total length of each of the sequences X and Y calculated by the total length calculator and the frequency of k-mers counted by the k-mer frequency counting unit, And a sequence similarity calculation unit configured to calculate a sequence similarity calculation unit.
In the k-mer conversion apparatus for measuring the degree of similarity of a sequence according to the above embodiment, the sequence similarity calculation unit may calculate the sequence similarity using the following equation (1)
(1)
Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]
May be used to calculate the sequence similarity.
According to another aspect of the present invention, there is provided a k-mer transformation method for measuring the degree of similarity of a sequence, comprising: inputting two sequences X and Y from a sequence input unit; The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit; generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k; replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map; counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And calculating the similarity (k-mer distance) of the sequence using the total length for each of the sequences X and Y calculated in the calculation step and the frequency for the k-mers counted in the counting step .
According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to the embodiment of the present invention, the total length of each of the sequences X and Y inputted from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers And the degree of similarity of the sequence is calculated so that it does not occupy a large amount of memory capacity, so that the processing speed is fast, and thus it is excellent in measuring the similarity of a large-capacity sequence.
1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.
Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art.
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
1 is a control block diagram of a k-mer conversion apparatus for measuring the similarity of sequences according to an embodiment of the present invention.
1, a k-mer conversion apparatus for measuring the degree of similarity of a sequence according to an embodiment of the present invention includes a
The
The sequence
The k-
The k-mer /
The k-mer
The sequence
Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]
To calculate the sequence similarity.
The
Hereinafter, a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented in an apparatus for measuring similarity of sequences according to an embodiment of the present invention configured as described above, will be described.
FIG. 2 is a flowchart showing a k-mer conversion method for measuring the degree of similarity of a sequence, which is implemented by a k-mer conversion apparatus for measuring the degree of similarity of sequences according to an embodiment of the present invention, ).
First, the
Then, the total length (l X , l Y ) of each of the sequences X and Y inputted in the step S1 is calculated by the total sequence length calculation unit 200 (S2)
In step S3, the k-
In step S4, the k-mer /
FIG. 3 is a diagram for explaining a step of generating a k-mer according to the set k in FIG. 2 and replacing the generated k-mer with a number using a hash map.
3, k is set to 7, "AATAATACTAA" represents sequence X, and "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" -mers. "AATAATA", "ATAATAC", "TAATACT", "AATACTA" and "ATACTAA" are replaced by "1", "2", "3", "4" and "5" respectively by a hash map.
"TAATACTAA" represents sequence Y, and "TAATACT", "AATACTA", and "ATACTAA" represent k-mers produced by dividing sequence Y. "TAATACT", "AATACTA", and "ATACTAA" are replaced by "3", "4", and "5" respectively by a hash map.
In step S5, the k-mer
In step S6, the sequence
Figure 4 is a comparison of memory usage and running time of an embodiment of the present invention versus prior art. As shown in FIG. 4, when comparing the k-mer conversion method for measuring the degree of similarity of a sequence according to the embodiment of the present invention and the prior art (without using the k-mer substitution method) The speed has increased by more than 2 times, and the memory efficiency has been improved by 30% to 60%.
According to the k-mer conversion apparatus and method for measuring the degree of similarity of sequences according to an embodiment of the present invention, the total length of each of the sequences X and Y input from the sequence input unit is calculated by the sequence total length calculation unit, and k mer generation unit divides the sequences X and Y inputted from the sequence input unit according to the set k to generate k-mers, and the k-mer / number substitution unit generates the k-mer in the sequence X and Y generated in the k- The k-mer numbers are replaced with numbers using a hash map, and the k-mer frequency counting unit counts the frequency of k-mers whose numbers have been substituted in the k-mer / number replacement unit, (K-mer distance) of the sequence using the total length of each of the calculated sequences X and Y and the frequency of the k-mers counted by the k-mer frequency counting unit, that is, The k-mers for each of X and Y are represented by numbers Ring by counting the number of occurrences, and configured to calculate the degree of similarity between the sequences being the not take up a lot of memory capacity, faster processing speed, and thus is suitable for measuring the degree of similarity of the large sequence.
Although the best mode has been shown and described in the drawings and specification, certain terminology has been used for the purpose of describing the embodiments of the invention and is not intended to be limiting or to limit the scope of the invention described in the claims. It is not. Therefore, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.
100: sequence input unit 200: sequence total length calculation unit
300: k-mer generation unit 400: k-mer /
500: k-mer frequency counting unit 600: sequence similarity calculating unit
Claims (3)
A sequence total length calculator configured to calculate an overall length of each of the sequences X and Y input from the sequence input unit;
A k-mer generating unit configured to generate k-mers by dividing each of the sequences X and Y inputted from the sequence input unit according to the set k;
A k-mer / number replacement unit configured to replace k-mers for each of the sequences X and Y generated by the k-mer generation unit with numbers using a hash map;
A k-mer frequency counting unit configured to count a frequency of k-mers substituted in the k-mer / number substitution unit; And
(K-mer distance) using the total length of each of the sequences X and Y and the frequency of k-mers counted by the k-mer frequency counting unit, calculated by the total length calculating unit of the sequence, And a k-mer conversion unit for calculating the similarity of the sequence.
The sequence similarity calculation unit calculates the sequence similarity by using the following Equation 1
(1)
Where X and Y represent the sequence, d X and Y represent k-mer distances between sequence X and Y, τ represents k-mer, and n X (τ) and n Y τ denotes the frequency of the corresponding k-mer in the sequences X and Y, l X and l Y denote the total length of the sequences X and Y, respectively, and k denotes a constant indicating the length of the k-mer it means]
Wherein the k-mer conversion apparatus is configured to calculate the sequence similarity using the k-mer conversion apparatus.
A step of inputting two sequences X and Y from a sequence input unit;
The total length of each of the sequences X and Y inputted in the input step is calculated by the total sequence length calculating unit;
generating a k-mer by dividing each of the sequences X and Y inputted in the input step according to the set k;
replacing k-mers for each of the sequences X and Y generated by the k-mer / digit replacement unit with numbers using a hash map;
counting the frequency of the k-mer frequency counting unit for k-mers whose numbers have been replaced in the replacing step; And
(K-mer distance) using the total length of each of the sequences X and Y calculated in the calculation step and the frequency of k-mers counted in the counting step. A k-mer transformation method for measuring the similarity of sequences.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150183675A KR20170074418A (en) | 2015-12-22 | 2015-12-22 | Apparatus and method for converting k-mer for measuring similarity of sequences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150183675A KR20170074418A (en) | 2015-12-22 | 2015-12-22 | Apparatus and method for converting k-mer for measuring similarity of sequences |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20170074418A true KR20170074418A (en) | 2017-06-30 |
Family
ID=59279669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150183675A KR20170074418A (en) | 2015-12-22 | 2015-12-22 | Apparatus and method for converting k-mer for measuring similarity of sequences |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20170074418A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111028897A (en) * | 2019-12-13 | 2020-04-17 | 内蒙古农业大学 | Hadoop-based distributed parallel computing method for genome index construction |
KR20240006339A (en) | 2022-07-06 | 2024-01-15 | 주식회사 코아아이티 | Device and method of selecting str marker candidates |
-
2015
- 2015-12-22 KR KR1020150183675A patent/KR20170074418A/en not_active Application Discontinuation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111028897A (en) * | 2019-12-13 | 2020-04-17 | 内蒙古农业大学 | Hadoop-based distributed parallel computing method for genome index construction |
KR20240006339A (en) | 2022-07-06 | 2024-01-15 | 주식회사 코아아이티 | Device and method of selecting str marker candidates |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110352389B (en) | Information processing apparatus and information processing method | |
US10192028B2 (en) | Data analysis device and method therefor | |
JP5493597B2 (en) | Search method and search system | |
US11288580B2 (en) | Optimal solution search method, optimal solution search program, and optimal solution search apparatus | |
KR20130107889A (en) | Aparatus and method for detecting anomalous subsequence | |
JP6200076B2 (en) | Method and system for evaluating measurements obtained from a system | |
KR20170074418A (en) | Apparatus and method for converting k-mer for measuring similarity of sequences | |
US20150142328A1 (en) | Calculation method for interchromosomal translocation position | |
Sogabe et al. | An acceleration method of short read mapping using FPGA | |
JP2019133305A (en) | Chaos gage correction device and program for chaos gage correction | |
Deorowicz et al. | Kalign-LCS—a more accurate and faster variant of Kalign2 algorithm for the multiple sequence alignment problem | |
He et al. | Inference of RNA structural contacts by direct coupling analysis | |
KR101584857B1 (en) | System and method for aligning genome sequnce | |
CN111936636B (en) | Determination of the frequency distribution of nucleotide sequence variants | |
CN113435599A (en) | Information processing apparatus, specifying method, and non-transitory computer-readable storage medium | |
CN112735596A (en) | Similar patient determination method and device, electronic equipment and storage medium | |
US20200105374A1 (en) | Mixture model for targeted sequencing | |
JP6789253B2 (en) | Search device, search method, and program | |
US20190095483A1 (en) | Search apparatus, storage medium, database system, and search method | |
JP6841039B2 (en) | Factor analyzer, factor analysis method, and program | |
Aldawiri et al. | A Novel Approach for Mapping Ambiguous Sequences of Transcriptomes | |
GUDODAGI et al. | Customized Computational Environment for Investigations and Compression of Genomic Data. | |
Munjal et al. | Sequence similarity using composition method | |
CN108090604A (en) | Based on the improved GM of trapezoid formula(1,1)Model prediction method | |
Weitschek et al. | Classifying bacterial genomes with compact logic formulas on k-Mer frequencies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |