KR20140140677A

KR20140140677A - Method for extracting longest common sub-sequence in sequence without appearance of duplicated token

Info

Publication number: KR20140140677A
Application number: KR1020130061174A
Authority: KR
Inventors: 이여름; 임을규; 강부중
Original assignee: 한양대학교 산학협력단
Priority date: 2013-05-29
Filing date: 2013-05-29
Publication date: 2014-12-10

Abstract

A method for extracting a longest common sub-sequence (LCS) from two sequences is applied to the extraction of partial data which is commonly included from data in bulk such as the analysis of protein sequence. The method for extracting an LCS has to trace cases for a pair of possible tokens since the token, which is the smallest unit configuring a sequence, appears several times on the sequence, generally. In case the token does not appear on the sequence two or more times, however, the LCS is able to be extracted more rapidly than an existing method. The present invention is able to extract the LCS in the sequence, in which the token is not duplicated, more rapidly than the existing method.

Description

METHOD FOR EXTRACTING LONGEST COMMON SUB-SEQUENCE IN SEQUENCE WITHOUT APPEARANCE OF DUPLICATED TOKEN BACKGROUND OF THE INVENTION 1. Field of the Invention < RTI ID = 0.0 >

And to a method for extracting a longest common partial sequence in a sequence in which tokens do not overlap.

Previously, we extracted the longest common subsequence through comparison of all possible pairs of tokens. Particularly in DNA analysis, the longest common partial sequence method is used to find the nucleotide sequence common to proteins.

The existing longest common partial sequence extraction algorithm has a large time and memory cost because it must compare all the positions of a sequence to a specific token in order to find all the positions where a specific token appears in the sequence. However, in a sequence where no token is duplicated, only one position where a specific token appears can be found. Therefore, it is possible to reduce the time for comparing all possible positions, and the memory space for storing the comparison result is not updated again in the subsequent process. Memory space is available and memory usage is also efficient.

How to update the array to find the longest common sequence of two sequences.

How to find the longest common sequence from an array that has been updated.

It is possible to extract the longest common subsequence sequence faster and more efficiently than the conventional algorithm in a sequence in which tokens do not overlap, thereby increasing the speed and memory utilization efficiency of the system.

FIG. 1 shows a method of extracting a conventional longest common partial sequence.
2 shows a method of extracting the longest common partial sequence of the present invention.
3 shows a flowchart of the longest common partial sequence of the present invention.

In the following, embodiments will be described in detail with reference to the accompanying drawings. Like reference symbols in the drawings denote like elements.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

The program similarity analysis tool extracts the frequency of occurrence of the instruction from the program and generates the instruction sequence in which the instruction is arranged according to the appearance frequency to extract the longest common partial sequence of the two instruction sequences for comparison with the instruction sequence of the other program, It relates to a system for analyzing the similarity of a program using the length of a sequence.

Applicable products and methods are malicious code detection tool and program plagiarism detection tool.

Future application areas are malicious code detection and program plagiarism detection.

Generally, tokens can appear redundant in a sequence, so you must perform a comparison of all possible pairs of tokens as in the previous method, but you do not have to perform a comparison of all possible pairs in a special case where no tokens overlap .

FIG. 1 shows a method of extracting a conventional longest common partial sequence.

The longest common subsequence extraction algorithm is a well - known algorithm for the dynamic programming problem. Since the comparison of all pairs of tokens must be performed, the performance of the algorithm is proved to be proportional to the product of the length of two sequences. As shown in Fig. 1, in order to extract the longest common partial sequence of two strings "BDCAE" and "ABCEDF ", a matrix having a size corresponding to the product of the length of each string must be filled. The column of the matrix represents the pair of tokens shown in the two strings. It is the basic idea of the algorithm to find the longest common part sequence by filling the resultant value of the comparison of all pairs of tokens into the column of the matrix. Therefore, the comparisons have to be performed on every pair of tokens, so that it takes time and memory space proportional to the product of the two sequence lengths.

The reason for performing a comparison for every pair of tokens is that they do not know where a particular token appears in the string. And because the tokens can appear redundant, you have to do a comparison to the end of the target string. However, if the tokens do not overlap, you may need to find a particular token once in the string and not compare the strings after it. Because there is no duplication, the token will not appear in subsequent strings. In other words, it only needs to find a specific token once in the string and update the value of that cell. At this time, the other cell is not updated, and the updated cell is also not updated in the following process. Therefore, the memory is not a matrix corresponding to the product of the sizes of two strings, but only an array having a size of one string. Also, when using the symbol table for the token, it is possible to shorten the process of searching for a location by searching a string to a constant time. In this case, however, as many memories as possible are used.

2 shows a method of extracting the longest common partial sequence of the present invention.

The left is an example of the extraction process as a numerical code. Using the string map of the string X, find the position of the array in which the particular token (xi) appears in the string Y, find the largest value at the left of the position in the array, (Xi) into the array at the position indicated by the string Y. If we do this for all the tokens (xi) of X, the data is stored like the last array in the right figure. Then, if the largest value is found from the rightmost one, the longest common partial sequence can be extracted.

3 shows a flowchart of the longest common partial sequence of the present invention.

FIG. 3 shows a flowchart of a method of extracting the longest common partial sequence of the present invention. There is an initialization process of the map for quickly determining the presence and position of the same token and an array update process for extracting the longest common partial sequence. Once the array update is complete, the longest common partial sequence that does not have duplicate tokens can be extracted.

The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

How to update the array to find the longest common sequence of two sequences.

How to find the longest common sequence from an array that has been updated.