KR20140140677A - Method for extracting longest common sub-sequence in sequence without appearance of duplicated token - Google Patents

Method for extracting longest common sub-sequence in sequence without appearance of duplicated token Download PDF

Info

Publication number
KR20140140677A
KR20140140677A KR1020130061174A KR20130061174A KR20140140677A KR 20140140677 A KR20140140677 A KR 20140140677A KR 1020130061174 A KR1020130061174 A KR 1020130061174A KR 20130061174 A KR20130061174 A KR 20130061174A KR 20140140677 A KR20140140677 A KR 20140140677A
Authority
KR
South Korea
Prior art keywords
sequence
longest common
token
extracting
tokens
Prior art date
Application number
KR1020130061174A
Other languages
Korean (ko)
Inventor
이여름
임을규
강부중
Original Assignee
한양대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한양대학교 산학협력단 filed Critical 한양대학교 산학협력단
Priority to KR1020130061174A priority Critical patent/KR20140140677A/en
Publication of KR20140140677A publication Critical patent/KR20140140677A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Stored Programmes (AREA)

Abstract

A method for extracting a longest common sub-sequence (LCS) from two sequences is applied to the extraction of partial data which is commonly included from data in bulk such as the analysis of protein sequence. The method for extracting an LCS has to trace cases for a pair of possible tokens since the token, which is the smallest unit configuring a sequence, appears several times on the sequence, generally. In case the token does not appear on the sequence two or more times, however, the LCS is able to be extracted more rapidly than an existing method. The present invention is able to extract the LCS in the sequence, in which the token is not duplicated, more rapidly than the existing method.

Description

METHOD FOR EXTRACTING LONGEST COMMON SUB-SEQUENCE IN SEQUENCE WITHOUT APPEARANCE OF DUPLICATED TOKEN BACKGROUND OF THE INVENTION 1. Field of the Invention < RTI ID = 0.0 >

And to a method for extracting a longest common partial sequence in a sequence in which tokens do not overlap.

Previously, we extracted the longest common subsequence through comparison of all possible pairs of tokens. Particularly in DNA analysis, the longest common partial sequence method is used to find the nucleotide sequence common to proteins.

The existing longest common partial sequence extraction algorithm has a large time and memory cost because it must compare all the positions of a sequence to a specific token in order to find all the positions where a specific token appears in the sequence. However, in a sequence where no token is duplicated, only one position where a specific token appears can be found. Therefore, it is possible to reduce the time for comparing all possible positions, and the memory space for storing the comparison result is not updated again in the subsequent process. Memory space is available and memory usage is also efficient.

How to update the array to find the longest common sequence of two sequences.

How to find the longest common sequence from an array that has been updated.

It is possible to extract the longest common subsequence sequence faster and more efficiently than the conventional algorithm in a sequence in which tokens do not overlap, thereby increasing the speed and memory utilization efficiency of the system.

FIG. 1 shows a method of extracting a conventional longest common partial sequence.
2 shows a method of extracting the longest common partial sequence of the present invention.
3 shows a flowchart of the longest common partial sequence of the present invention.

In the following, embodiments will be described in detail with reference to the accompanying drawings. Like reference symbols in the drawings denote like elements.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

The program similarity analysis tool extracts the frequency of occurrence of the instruction from the program and generates the instruction sequence in which the instruction is arranged according to the appearance frequency to extract the longest common partial sequence of the two instruction sequences for comparison with the instruction sequence of the other program, It relates to a system for analyzing the similarity of a program using the length of a sequence.

Applicable products and methods are malicious code detection tool and program plagiarism detection tool.

Future application areas are malicious code detection and program plagiarism detection.

Previously, we extracted the longest common subsequence through comparison of all possible pairs of tokens. Particularly in DNA analysis, the longest common partial sequence method is used to find the nucleotide sequence common to proteins.

Generally, tokens can appear redundant in a sequence, so you must perform a comparison of all possible pairs of tokens as in the previous method, but you do not have to perform a comparison of all possible pairs in a special case where no tokens overlap .

FIG. 1 shows a method of extracting a conventional longest common partial sequence.

The longest common subsequence extraction algorithm is a well - known algorithm for the dynamic programming problem. Since the comparison of all pairs of tokens must be performed, the performance of the algorithm is proved to be proportional to the product of the length of two sequences. As shown in Fig. 1, in order to extract the longest common partial sequence of two strings "BDCAE" and "ABCEDF ", a matrix having a size corresponding to the product of the length of each string must be filled. The column of the matrix represents the pair of tokens shown in the two strings. It is the basic idea of the algorithm to find the longest common part sequence by filling the resultant value of the comparison of all pairs of tokens into the column of the matrix. Therefore, the comparisons have to be performed on every pair of tokens, so that it takes time and memory space proportional to the product of the two sequence lengths.

The reason for performing a comparison for every pair of tokens is that they do not know where a particular token appears in the string. And because the tokens can appear redundant, you have to do a comparison to the end of the target string. However, if the tokens do not overlap, you may need to find a particular token once in the string and not compare the strings after it. Because there is no duplication, the token will not appear in subsequent strings. In other words, it only needs to find a specific token once in the string and update the value of that cell. At this time, the other cell is not updated, and the updated cell is also not updated in the following process. Therefore, the memory is not a matrix corresponding to the product of the sizes of two strings, but only an array having a size of one string. Also, when using the symbol table for the token, it is possible to shorten the process of searching for a location by searching a string to a constant time. In this case, however, as many memories as possible are used.

2 shows a method of extracting the longest common partial sequence of the present invention.

The left is an example of the extraction process as a numerical code. Using the string map of the string X, find the position of the array in which the particular token (xi) appears in the string Y, find the largest value at the left of the position in the array, (Xi) into the array at the position indicated by the string Y. If we do this for all the tokens (xi) of X, the data is stored like the last array in the right figure. Then, if the largest value is found from the rightmost one, the longest common partial sequence can be extracted.

3 shows a flowchart of the longest common partial sequence of the present invention.

FIG. 3 shows a flowchart of a method of extracting the longest common partial sequence of the present invention. There is an initialization process of the map for quickly determining the presence and position of the same token and an array update process for extracting the longest common partial sequence. Once the array update is complete, the longest common partial sequence that does not have duplicate tokens can be extracted.

The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims (2)

How to update the array to find the longest common sequence of two sequences. How to find the longest common sequence from an array that has been updated.
KR1020130061174A 2013-05-29 2013-05-29 Method for extracting longest common sub-sequence in sequence without appearance of duplicated token KR20140140677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020130061174A KR20140140677A (en) 2013-05-29 2013-05-29 Method for extracting longest common sub-sequence in sequence without appearance of duplicated token

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020130061174A KR20140140677A (en) 2013-05-29 2013-05-29 Method for extracting longest common sub-sequence in sequence without appearance of duplicated token

Publications (1)

Publication Number Publication Date
KR20140140677A true KR20140140677A (en) 2014-12-10

Family

ID=52458381

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020130061174A KR20140140677A (en) 2013-05-29 2013-05-29 Method for extracting longest common sub-sequence in sequence without appearance of duplicated token

Country Status (1)

Country Link
KR (1) KR20140140677A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109620241A (en) * 2018-11-16 2019-04-16 青岛真时科技有限公司 A kind of wearable device and the movement monitoring method based on it

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109620241A (en) * 2018-11-16 2019-04-16 青岛真时科技有限公司 A kind of wearable device and the movement monitoring method based on it

Similar Documents

Publication Publication Date Title
JP6894058B2 (en) Hazardous address identification methods, computer-readable storage media, and electronic devices
US8996356B1 (en) Techniques for predictive input method editors
US8977626B2 (en) Indexing and searching a data collection
WO2015184992A1 (en) Method for recognizing duplicate image, and image search and deduplication method and device thereof
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
WO2016015621A1 (en) Human face picture name recognition method and system
US9996271B2 (en) Storage controller and method of operating the same
KR101643729B1 (en) System and method of data managing for time base data backup, restoring, and mounting
US20130262090A1 (en) System and method for reducing semantic ambiguity
KR102260631B1 (en) Duplication Image File Searching Method and Apparatus
US10224958B2 (en) Computer-readable recording medium, encoding apparatus, and encoding method
KR101541603B1 (en) Method and apparatus for determing plagiarism of program using control flow graph
US9984065B2 (en) Optimizing generation of a regular expression
US20140309984A1 (en) Generating a regular expression for entity extraction
US8990741B2 (en) Circuit design support device, circuit design support method and program
US10884873B2 (en) Method and apparatus for recovery of file system using metadata and data cluster
KR20140140677A (en) Method for extracting longest common sub-sequence in sequence without appearance of duplicated token
JP7245817B2 (en) Continuous value matching in data processing equipment
CN104049949A (en) Peephole optimization method based on BSWAP instruction
CN116301775A (en) Code generation method, device, equipment and medium based on reset tree prototype graph
US20150095897A1 (en) Method and apparatus for converting programs
KR101559651B1 (en) Method and apparatus of dynamic analysis
CN108664900B (en) Method and equipment for identifying similarities and differences of written works
CN111788552A (en) System and method for low latency hardware memory
KR20210100076A (en) Generate vector predicate summary

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination