WO2005096208A1 - 塩基配列検索装置及び塩基配列検索方法 - Google Patents
塩基配列検索装置及び塩基配列検索方法 Download PDFInfo
- Publication number
- WO2005096208A1 WO2005096208A1 PCT/JP2005/006397 JP2005006397W WO2005096208A1 WO 2005096208 A1 WO2005096208 A1 WO 2005096208A1 JP 2005006397 W JP2005006397 W JP 2005006397W WO 2005096208 A1 WO2005096208 A1 WO 2005096208A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- base sequence
- input
- unit
- sequence
- base
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- Base sequence search apparatus and base sequence search method are Base sequence search apparatus and base sequence search method
- the present invention relates to an apparatus and a method for searching for a gene base sequence representing gene information.
- DNA has a structure in which nucleotides including bases of adenine (A), cytosine (C), guanine (G), and thymine (T) are arranged side by side.
- the structure of A and T and G and C form a double helix structure.
- the nucleotide sequence of the DNA that expresses the gene (hereinafter referred to as the “gene base sequence”) is transcribed into RNA (Ribonucleic Acid), spliced to produce mRNA (messenger RNA), and the protein is synthesized.
- RNA is a nucleic acid having D-ribose as a sugar component and bases of adenine (A), cytosine (C), guanine (G), and peracil (U).
- RNA interference is a phenomenon in which the presence of a specific double-stranded RNA in a cell destroys the mRNA of a specific sequence and suppresses gene expression. This phenomenon was first discovered in experiments with nematode cells. Later, this phenomenon became known to occur in mammalian cells and attracted attention. By artificially causing RNA interference, the function of a specific gene can be suppressed, and the function of that specific gene can be examined. Also, by using RNA interference, there is a possibility that a drug that exerts the effect of suppressing the action of a specific gene can be developed.
- FIG. 1 is a diagram schematically showing the process of RNA interference.
- RNA interference is thought to occur through the following processes.
- siRNA long short interfering RNA
- RNA-induced silencing complex 102 is formed.
- RISC (102) is compatible with the siRNA.
- the mRNA (103) becomes nonfunctional.
- “there is homology between one base sequence (S) and another base sequence (T)” means that two base sequences (S, T) have complementarity, or , Imperfect complementarity.
- “Complementarity” means that pairs of A and T, G and C, and A and U are completely formed in the entire two base sequences. Therefore, homology means that pairs other than A and T, G and C, and A and U occur in a part of two base sequences.
- homology is often determined to be 80% or more, preferably 90% or more, and more preferably 95% or more. Considering not only the ratio of complementary base pairs but also the number of consecutive complementary base sequences in the base sequence, the homology between the two base sequences can be considered. They may also determine the gender. It is also known that a pair of G and U may be formed in three types of base pairs having complementarity of A and T, G and C, and A and U, The presence or absence of homology may be determined in consideration of the existence of G and U base pairs.
- siRNA sequence that appears only in the gene of interest and has no homology to the nucleotide sequence of another gene. Therefore, when designing an siRNA sequence, it is necessary to confirm that a gene having a nucleotide sequence similar to the siRNA sequence does not exist other than the target gene.
- a “microarray” is a type of DNA chip in which oligo DNA having a length of about 15 to 60 bases is synthesized on a substrate such as glass (for example, see Non-Patent Document 1).
- FIG. 2 exemplifies processes such as gene analysis and genetic diagnosis using a microarray.
- a DNA (202) to which a label 203 such as a fluorescent dye is added flows on a microarray 201 having oligo DNA synthesized on a substrate such as glass, the DNA complements or becomes complementary to the DNA.
- Oligo DNA on the microarray having the same sex binds (hybridizes) (symbol 204).
- the type of DNA (202) and the like are determined by detecting the fluorescence with the fluorescent dye of the label to determine where the oligo DNA hybridized.
- FIG. 2 several oligo DNAs are not shown on the microarray, but in the actual microarray, oligo DNAs are arranged in the order of 10,000 in a region of about 0.5 inch in length and width.
- BLAST for example, see Non-Patent Document 2
- Smith-Waterman for example, see Non-Patent Document 3
- Non-patent Document 1 Naoki Sugimoto, “Gene Chemistry”, 19 pages, published by Kagaku Doujin Inc., 2002
- Non-Patent Document 2 S.F.Altschul, W. Gish, W. Miller, E.W.Myers, and D.J.
- Non-Patent Document 3 T.F.Smith, and MS Waterman, "Identification of com mon molecular subsequences", J. Mol. Biol., 147, 195—197, 1981, Disclosure of the invention.
- the method using BLAST has a problem in that it overlooks the existence of similar nucleotide sequences.
- BLAST a search is usually performed using a portion in which seven bases are consecutively the same. For this reason, when a base sequence of 19 bases is given, for example, a base sequence having base mismatch or mismatch at the position of X in FIG. 3 cannot be found, and a similar base sequence is overlooked.
- an object of the present invention is to provide an apparatus and a method capable of detecting the presence of a similar base sequence with a small amount of calculation.
- the present invention specifies two partial sequences having a predetermined length and a remaining portion of the input base sequence force, and the corresponding bases do not match!
- the Hamming distance which is the number of bases to be replaced with bases, is divided and assigned to those subsequences and the rest, and is assigned to each of the two subsequences.
- a search is performed by selecting the one with a smaller total number of base sequences obtained by adding an operation of replacing the bases with the Hamming distance with incompatible bases.
- the amount of calculation required for search can be reduced, and the no and mining distances can be the same as or equal to predetermined values. There is no possibility of overlooking the presence of a similar base sequence that is less than the predetermined value.
- a base for searching for a similar base sequence using an index for searching for the occurrence of a base sequence of a predetermined length in a database storing a gene base sequence representing gene information is used.
- a sequence search device two partial sequences of a predetermined length and the remaining portion are specified from the input base sequence, and the number of bases to be replaced with bases whose corresponding bases do not match The Hamming distance is divided and assigned to those subarrays and the rest, and for each of the two subarrays, A description will be given of a base sequence search apparatus that performs a search by selecting a base having a smaller total number of base sequences obtained by adding an operation of replacing the bases of the assigned Hamming distance with incompatible bases.
- corresponding bases are compatible
- the binary relation often means that the bases forming a pair are the same.
- this corresponds to the case where the binary relation satisfies only the reflex rule. It is also possible to use a binary relation taking into account the fact that the bases G and U are easily bonded.
- the “predetermined length” is a predetermined length. This predetermined length is the length of the base sequence that can be accepted by the index for searching the database storing the gene base sequence. For example, in the case of BLAST, the predetermined length is usually 7.
- the “similar nucleotide sequence” is a nucleotide sequence having the same length and similarity to the input nucleotide sequence, and is a nucleotide sequence appearing in the gene nucleotide sequence. The term “similar” means that, for example, as described later, the mining distance from the input base sequence becomes a given value.
- the “gene base sequence” is a base sequence stored in a database. Depending on the structure of the index, there may be a plurality of predetermined lengths.
- Such a base sequence search device receives, for example, a similarity (for example, Hamming distance) to the base sequence input to a web browser, and stores the data in which the gene base sequence is stored.
- the present invention can be implemented as a server device that issues a combination or the like, performs a process, and returns a result to the WEB browser. Therefore, each unit and each means, which are constituent elements of the base sequence search apparatus according to the present invention, can be configured by hardware, software, or both hardware and software (program). is there. For example, as an example of realizing these, when using a computer, hardware consisting of CPU, memory, bus, interface, peripheral devices, etc., and software executable on these hardware are used. Can be mentioned.
- FIG. 4 shows an example of a functional block diagram of the base sequence search device according to the first embodiment of the present invention.
- the base sequence search device 400 includes a base sequence input unit 401, a Hamming distance input unit 402, It has a specification unit 403, an allocation unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407.
- the “base sequence input section” 401 inputs a base sequence having a length exceeding a predetermined length. For example, a web browser receives information indicating the input base sequence.
- “Hamming distance input unit” 402 inputs a Hamming distance for an input base sequence.
- the web browser power also receives the input numerical value.
- the “input base sequence” is a base sequence input to the base sequence input unit 401.
- the Hamming distance is a value indicating the number of bases to be replaced with incompatible bases. No., the ming distance is defined for two base sequences of the same length, and the number of corresponding bases is incompatible. By specifying the hamming distance for one base sequence, In addition, a set of base sequences obtained by replacing the number of bases with the Hamming distance with incompatible bases can be defined.
- the "specifying portion" 403 is a partial sequence of the input base sequence, has a predetermined length, and specifies two different partial sequences and the remaining portion.
- the two subsequences may have a common part. In some cases, the remaining portion may not be provided.
- FIG. 7 exemplifies the two partial arrays specified by the specifying unit 403 and the remaining part.
- the first partial sequence 711 and the second partial sequence 712 are present in the input base sequence 710 so as not to have a common part, and the two ends and the center of the input base sequence are The rest are 713, 714, 715.
- the first partial sequence 721 and the second partial sequence 724 have a common portion substantially at the center of the input base sequence 720, and the remaining portion is located at the end of the input base sequence 720. 723, 724 power.
- the first part Sequence 731 extends from the left end of input base sequence 730
- second partial sequence 732 extends from the right end of input base sequence 730
- first partial sequence 731 and second partial sequence 732 form input base sequence 730.
- the input base sequence exceeds twice the predetermined length, the approximate center of the input base sequence 740 becomes the remaining portion 743 as illustrated in FIG. 7 (4).
- the lengths of the first and second subsequences are fixed lengths, depending on the structure of the index, there may be a plurality of predetermined lengths as described above.
- the length of the first partial sequence and the length of the second partial sequence may be the same or different.
- the “assigning unit” 404 divides and assigns the Hamming distance input by the Hamming distance input unit 402 to the partial array specified by the specifying unit 403 and the remaining portion.
- "dividing and assigning the Hamming distance” means dividing the No. and Mining distances into non-negative integers and allocating the integers obtained by the division to the partial array and the remainder. Therefore, the sum of the assigned values is the no and the mining distance.
- Such processing can be easily realized by a program. For example, it can be realized by a program that nests loops corresponding to the number of subarrays and the remainder, and all allocations can be obtained.
- FIG. 17 shows an example in which an example of a program for dividing and assigning the no and mining distances is described in C language.
- the partial sequence is specified by a numeral. For example, if the number of sub-arrays is, the sub-arrays are specified by P, P-1, P-2, ..., 1 and the P-th, P- 1, ... Assume that the first element corresponds to the subarray.
- distributeHammingDistance every time distributeHammingDistance is called one time, one of vec [P], vec [P-1], vec [P 2], ⁇ VeC [l],
- the Hamming distance assigned to the subarray is substituted, and a recursive call to distributeHammingDistance is made. For example, In a certain call to distributeHammingDistance, if the Hamming distance assigned to the subarray q is assigned to vec [q], if q is not 1, the first argument of distributeHammingDistance is set to q-1. A recursive call is made.
- int represents an integer data type.
- int h means that a variable h takes an integer data type value.
- ⁇ S4 ⁇ means that S1 is executed first, and as long as the condition of S2 is satisfied, execution of S4 and then execution of S3 are repeated.
- DistributeHammingDistance is illustrated in FIG. 17 as an example! /
- FIG. 8 is a diagram for explaining the allocation of the hamming distance by the allocation unit 404 corresponding to the cases (1) to (4) of FIG.
- the allocation unit 404 from the left end of the input base sequence (that is, the remaining portion, the first partial sequence, the remaining portion, the second partial sequence, the remaining portion), m, Given values m, m, m, and m, m, m, m, m, m
- the “input Hamming distance” is The Hamming distance input to the input unit 402.
- the “selection unit” 405 performs the operation of replacing the number of bases indicated by the Hamming distance assigned by the assignment unit among the two partial sequences identified by the identification unit 403 with incompatible bases. Select the one that does not have a large total number of substitution base sequences, which are base sequences generated by performing a partial sequence. This total number has the formula: (number of incompatible bases),
- the sub-array having the larger Hamming distance allocated by the allocation unit 404 is selected. That is, in the case of FIG. 8 (1), comparing m and m, for example,
- the number may be smaller. Therefore, the size of the total number of the substituted base sequences may not match the size of the Hamming distance, so care must be taken. In the following, the description will be simplified. Therefore, the description will be made assuming that the total number of substituted base sequences is not large when the Hamming distance allocated by the allocation unit 404 is not large.
- m is compared with m. For example, if m is not larger,
- FIG. 9 shows that, when the input Hamming distance is 3, the allocation in allocation section 404 and the selection in selection section 405 are performed when the partial arrangement and the remaining part are specified as shown in FIG. 7 (4). Is shown.
- the sum of m, m, and m is equal to the input Hamming distance 3,
- the number of choices that make 3 1 3 1 is greater than the number of choices that make m ⁇ m.
- substitution base sequence generation unit 406 If the substitution base sequence generation unit 406 generates the substitution base sequence and performs a search by referring to the index in the search unit, the search for 10 cases where the sum of m, m, and m is 3 will be covered.
- the no and the mining distance input to the no and the mining distance input unit are assigned to a plurality of parts, and the smaller one such as m ⁇ m and m> m is selected,
- the sum of m 1, m 2, and m gives the input Hamming distance.
- FIG. 10 shows that when the input Hamming distance is 3, the allocation in allocation section 404 when the partial array and the remaining part are specified as shown in FIG. Indicates a selection.
- the sum of m 1, m 2 and m is equal to the input Hamming distance 3
- the hamming distance input to the no / ming distance input unit is assigned to a plurality of parts, and m + m> m + m and m + m ⁇ m + m Not so big
- substitution base sequence generation unit 406 generates a substitution base sequence having a mining distance and a mining distance allocated by the allocation unit 404 for the partial sequence selected by the selection unit 405. That is, among the bases of the partial sequence selected by the selection unit 405, the number of bases indicated by the Hamming distance allocated by the allocation unit 404 is replaced with a base that does not match. Generate a base sequence. For example, in the case of FIG. 9, for the first partial sequence, a partial sequence in which the min and min distances are 0 and 1 is generated as a substituted base sequence. Also for the second partial sequence, a partial sequence having a Hamming distance of 0 or 1 is generated as a substituted base sequence. If the hamming distance is 1, if the hamming distance is 1, any one of the bases in the first subsequence is replaced with an incompatible base. A base sequence is generated.
- the first partial sequence is generated as a substituted base sequence having a Hamming distance of 0, 1, 2, or 3.
- a partial sequence having no and mining distances of 0 and 1 is generated as a substituted base sequence.
- the input Hamming distance force S3 and the need to generate a substituted base sequence with a No and Mining distance of 3 seem to be inefficient.
- 3 was assigned to m, so the first part
- a program for generating a substitution base sequence can be easily created. For example, a program in which loops are nested is created, and the position of a partial sequence in which a base is replaced with an incompatible base is specified by an outer loop. However, the substitution of the base at the position specified by the outer loop with an unsuitable base may be performed by the inner loop. If the predetermined length is L and it is defined that the bases differ when they do not match, then in the case of Fig. 9, 1 + 3C
- One replacement base sequence is generated.
- 1 + 3 C + 3 2 C + 3 of 3 C Street A replacement nucleotide sequence is generated, but the amount of computation required for this generation is generally L, which is smaller than the length of the input nucleotide sequence1. Less than the computational complexity to get everything! / ,.
- FIG. 18 shows an example of a program that, when a hamming distance of 2 is assigned to a partial sequence having a length L by a sequence S, a replacement base sequence of the partial sequence is generated.
- the subscripts of the sequence start from 0, and any of the symbols A, C, G, or T indicating bases in S [0], S [l], ⁇ , S [L-1] Is stored.
- foreach a 1 in ⁇ A, C, G, T ⁇ ⁇ S ⁇ indicates that S is executed while changing the value of the variable al to A, C, G, T one after another. .
- FIG. 18 shows an example of a program that, when a hamming distance of 2 is assigned to a partial sequence having a length L by a sequence S, a replacement base sequence of the partial sequence is generated.
- the subscripts of the sequence start from 0, and any of the symbols A, C, G, or T indicating bases in S [0], S [l], ⁇ , S [L-1] Is stored.
- Search unit 407 performs a search using the above-mentioned index using the replacement base sequence generated by the replacement base sequence generation unit as a key. Indexes are often implemented using hashing techniques.
- the “index” is an index for searching for an occurrence of a base sequence of a predetermined length in a database storing a gene sequence. By such a search using the index, generally, information on the position where the substituted base sequence appears (for example, information indicating the position of the base at the end of the substituted base sequence from the ⁇ end of the DNA) ) Is obtained.
- search unit 407 makes an inquiry to the database. Also, if there is another server that has such a database, the search unit 407 sends an inquiry to the server, receives a query, and receives the result! You can! /
- FIG. 11 illustrates a flowchart of a process of the base sequence search apparatus of FIG. 4 according to the present embodiment.
- a base sequence is input by the base sequence input unit 401 or the like (base sequence input step).
- the hamming distance is input from the No. and ming distance input units 402 (humming distance input step).
- the specifying unit 403 and the like specify the two partial arrays and the remaining part (specific step).
- the input unit Hamming distance is divided and assigned by the assigning unit 404 and the like ( Assignment step).
- step S1105 the selection unit 405 or the like selects the partial sequence having the smaller total number of the substituted base sequences having the Hamming distance allocated in the allocation step so that duplication does not occur (selection step).
- step S 1106 a substituted base sequence is generated by the substituted base sequence generating unit 406 or the like (substituted base sequence generating step).
- step S1107 a search is performed by the search unit 407 or the like (search step).
- the base sequence search device uses a base sequence search method including a base sequence input step, a Hamming distance input step, a specifying step, an assignment step, a selection step, a replacement base sequence generation step, and a search step. It can be considered as a device.
- step S1101 the hamming distance to be input in step S1102 is 0, 1, 2, 3, 4, and so on.
- the other steps may be repeatedly executed while changing.
- step S1101103 may be performed
- step S1102 may be performed, and other steps may be performed.
- steps S1101 to S1104 steps S1105 and subsequent steps may be collectively executed. In this way, the calculation can be efficiently performed without repeating the search using the same subsequence again.
- the amount of calculation required for the search can be reduced, and the similar base sequence in which the no and the mining distance are a predetermined value or less, or a combination of arbitrary values is obtained. You can search without omission.
- the configuration of the base sequence search device represented by the functional block diagram of FIG. 4 can be realized by hardware such as a CPU, a memory, and other LSIs of any computer. it can. Further, the software can be realized by a program or the like loaded into a memory. Further, it can also be realized by cooperation between hardware and software. In particular, when software is used, the programs that constitute such software are recorded on various media, and if necessary, used as a computer to implement a base sequence search device. It can be read mechanically.
- “medium” refers to any “portable physical medium” such as a flexible disk, magneto-optical disk, ROM, EPROM, EEPROM, CD-ROM, MO, DVD, flash disk, and various computer systems.
- the computer is not limited to a mainframe computer, but may be an information processing device such as a workstation or a personal computer. Further, a peripheral device such as a printer or a scanner may be further connected to such an information processing device.
- a “program” is a data processing method described in an arbitrary language or description method, and may be in any form such as a source code or a binary code. Note that a “program” is not necessarily limited to a single configuration, but may be distributed and configured as multiple modules or libraries, or may operate in conjunction with a separate program typified by an operating system. Including those that achieve the above. It should be noted that a known configuration or procedure can be used for a specific configuration for reading the medium in the base sequence search device, a reading unit, an installation procedure after reading, and the like.
- the base sequence input unit 401, the Hamming distance input unit 402, the specifying unit 403, the assignment unit 404, the selection unit 405, and the replacement base sequence generation unit 406 can be realized as a module constituting a program. Such modules are naturally controlled by the computer CPU.
- the base sequence search device includes an external system, such as the Internet, that provides an external program or the like for searching an external database relating to the base sequence information of genes and the like.
- Configuration communicably connected via a communication network May be used.
- a powerful configuration provides a website for running external programs.
- the external system may be configured as a WEB server, an ASP server, or the like.
- the base sequence search device may be communicably connected to an external system.
- the configuration of the communication network is not particularly limited, for example, it is configured by a communication device such as a router or a wired or wireless communication line such as a dedicated line.
- FIG. 12 shows an example of a functional block diagram of the base sequence search device according to the second embodiment of the present invention.
- the base sequence search device 1200 includes a base sequence input unit 401, a Hamming distance input unit 402, a specifying unit 403, an assignment unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407.
- the specifying unit 403 includes a first specifying unit 1201. Therefore, the base sequence search device according to the present embodiment has a configuration in which the specifying unit of the base sequence search device according to Embodiment 1 has the first specifying means.
- First specifying means 1201 is one of the two partial sequences as long as the number of bases of the base sequence input at the base sequence input section is not more than twice or less than twice the predetermined length. The end of the partial sequence is made coincident with the other end of the input base sequence, and the remaining portion does not occur and is not specified. Since the remaining part does not occur and is not specified, the allocating unit does not allocate the no and the mining distance to the remaining part.
- the first specifying means specifies the first partial arrangement and the second partial arrangement as shown in FIG. 7 (3). Therefore, in such a case, the first embodiment has already been described, and the subsequent description will be omitted.
- FIG. 13 shows an example of a functional block diagram of a base sequence search device according to Embodiment 3 of the present invention.
- the base sequence search device 1300 includes a base sequence input unit 401, a Hamming distance input unit 402, a specifying unit 403, an assignment unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407.
- the specifying unit 403 includes a second specifying unit 1301.
- the specifying unit 403 may include the first specifying unit described in the second embodiment. Therefore, the base sequence search device according to the present embodiment has a configuration in which the specifying unit of the base sequence search device according to Embodiment 1 or 2 includes the second specifying unit.
- the "second specifying means" 1301 determines that the two partial sequences do not overlap if the number of bases in the base sequence input in the base sequence input section is greater than twice the predetermined length. , And identify the two subsequences. In this case, the remaining portion may be one or two.
- the input base sequence is specified so that two partial sequences are arranged at the left and right ends of the input base sequence, or the input base sequence is specified such that the two partial sequences are connected.
- the second specifying means specifies the first partial sequence and the second partial sequence as shown in FIG. 7 (4). Therefore, in such a case, the first embodiment has already been described, and the subsequent description will be omitted.
- a base sequence search device that obtains a candidate for a similar base sequence based on a search result of a search unit and determines a Hamming distance from an input base sequence I will explain it.
- FIG. 14 shows an example of a functional block diagram of a base sequence search device according to Embodiment 4 of the present invention.
- the base sequence search device 1400 includes a base sequence input unit 401, a Hamming distance input unit 402, a specifying unit 403, an assignment unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407. It has a candidate base sequence acquisition unit 1401 and a determination unit 1402. Further, the specifying unit 403 may include one or both of the first specifying unit described in the second embodiment and the second specifying unit described in the third embodiment. Therefore, the base sequence search device according to the present embodiment has a configuration in which any one of the base sequence search devices according to Embodiments 1 to 3 includes the similar candidate base sequence acquisition unit 1401 and the determination unit 1402. It has become.
- Similar candidate nucleotide sequence acquisition unit 1401 acquires a similar candidate nucleotide sequence based on the search result obtained by the search unit 407.
- the “similar candidate nucleotide sequence” is a nucleotide sequence appearing in a gene nucleotide sequence including a substitution nucleotide sequence. More specifically, for example, if a search is performed using the replacement base sequence of the first partial sequence and the position of the base at the end of the replacement base sequence is found, the first partial sequence occupies the input base sequence. A gene base sequence of the same length as the input base sequence is obtained from the base sequence obtained in consideration of the positional relationship.
- the position obtained by the search is the position of the leftmost base of the first partial sequence, the length of the remaining portion on the left side of the first partial sequence (if such a remaining portion (If no, set to 0.) From the left position, obtain a gene base sequence of the same length as the input base sequence.
- the position force to the right by the length of the remaining portion on the right side of the second subsequence is also directed to the left, and is the same as the input base sequence.
- the "determining unit" 1402 calculates the Hamming distance between the similar candidate base sequence obtained by the similar candidate base sequence obtaining unit and the input base sequence, by using the Hamming distance input to the Hamming distance input unit 402, It is determined whether or not it is less than or equal to the input Hamming distance pair. This determination can be made by performing a comparison in order from the base at the end of the input base sequence and the similar candidate base sequence.
- the flowchart of the process of the base sequence search device includes a step of acquiring a similar candidate base sequence after step S1107 of the flowchart illustrated in FIG. 11, and a step of acquiring a similar candidate base sequence and an input base. A step of determining whether the hamming distance to the array is equal to the input hamming distance U or not.
- a nucleotide sequence similar to the input nucleotide sequence can be obtained. For example, information on a gene that may be inactivated by siRNA other than the target gene to be inactivated by siRNA is obtained. It is possible to obtain.
- a description will be given of a base sequence search apparatus capable of designating a combination of bases that are incompatible.
- FIG. 15 shows an example of a functional block diagram of a base sequence search device according to Embodiment 5 of the present invention.
- the base sequence search device 1500 includes a base sequence input unit 401, a Hamming distance input unit 402, a specifying unit 403, an assignment unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407. It has a candidate base sequence acquisition section 1401, a determination section 1402, and an incompatible base set input section 1501. Therefore, the base sequence search device according to the present embodiment has a configuration in which the base sequence search device according to the fourth embodiment has the mismatched base set input unit 1501.
- the “incompatible base set input unit” 1501 specifies a set of incompatible bases. For example, enter text information indicating a base pair that should be determined to be incompatible. Alternatively, by inputting a pair of bases to be determined to be compatible (for example, G and U), a set of bases to be determined to be indirectly incompatible may be specified.
- a search is performed by the search unit based on the set of bases input to the mismatched base set input unit 1501, and the no and mining distances are obtained. For example, based on the set of bases input by the non-conforming base set input unit 1501, a substituted base sequence is generated by the substituted base sequence generation unit 406, a search unit 407 selects a database for search, and determines The Hamming distance is determined in part 1402.
- a combination of bases that are weak but may bind such as G and U, can be considered, and a more accurate base sequence can be designed.
- Embodiment 6 of the present invention a base sequence search device capable of designating the distribution of base matching between an input base sequence and a similar base sequence will be described.
- FIG. 16 shows an example of a functional block diagram of a base sequence search device according to Embodiment 6 of the present invention.
- the base sequence search device 1600 includes a base sequence input unit 401, a Hamming distance input unit 402, a specifying unit 403, an assignment unit 404, a selection unit 405, a replacement base sequence generation unit 406, and a search unit 407. It has a candidate base sequence acquisition unit 1401, a determination unit 1402, and a matching distribution input unit 1601, and the determination unit 1402 has a determination unit 1602.
- base sequence detection The cable search device 1600 may have the mismatched base set input unit described in the fifth embodiment. Therefore, in the base sequence search device according to the present embodiment, the base sequence search device according to Embodiment 4 or 5 has the matching distribution input unit 1601 and the determination unit 1402 has the determination unit 1602, It has a configuration! /
- the “match distribution input unit” 1601 inputs distribution information representing the match distribution of the bases corresponding to the base sequence and the similar base sequence input to the base sequence input unit 401.
- distribution information include information indicating that base mismatches occur less or more at the 5 'end, and that base mismatches occur at approximately equal intervals.
- the distribution information may be, for example, a program for determining a distribution of matching bases. Alternatively, it may be information for selecting some of the types of distributions of base matches that have been previously determined.
- Distribution determining means 1602 determines whether or not the distribution information input by the adaptive distribution input unit 1602 is satisfied.
- the determination unit 1402 may display the result of the determination by the distribution determination unit together with the similar base sequence.
- the base sequence search device is characterized in that, in the base sequence search device according to the sixth embodiment, distribution information input at the matching distribution input unit 1601 is compared with a base sequence and a similar base sequence. This is the lower limit of the length to which the corresponding bases continuously match.
- the length of the base sequence input to the base sequence input section is 15 to 60, preferably 15 to 25. Ma In this embodiment, the predetermined length is 11 to 14.
- the base sequence search apparatus By setting the length of the base sequence input to the base sequence input section to 15 to 60, preferably 15 to 25, the base sequence search apparatus according to the present embodiment is suitable for siRNA design. It can be. In the database used by the inventor for the benchmark test, when the length of the input base sequence was 19 or 20, when the predetermined length was 11 to 14, the fastest search could be performed.
- the predetermined length is small, the number of candidates for similar candidate nucleotide sequences increases, while when the predetermined length is large, the amount of calculation is required for the generation of the replacement nucleotide sequence in the replacement nucleotide sequence generation unit,
- the number of misses increases when queries are made to the hash tables that make up the index.In other words, the number of queries that refer to arrays that do not exist in the original database increases, and the amount of computation increases.
- the intermediate point is considered to be the case where the predetermined length is 11 to 14.
- the length of the base sequence input to the base sequence input section was not limited to 19 or 20, and that a search from 15 to 60 could be performed practically.
- the present invention can be used for determining the sequence of an oligo DNA having a length of about 60.
- a character string similar to the input character string can be searched from the character strings stored in the database.
- “similar” means that the input character string is a character string having a predetermined Hamming distance or the input character string.
- the following character string search device is provided. That is, it is an index for searching a database storing character strings in which alphabets are arranged one-dimensionally, and a character string of a predetermined length, which is a predetermined length, appears in the character strings stored in the database. By using an index for searching for a position, a similar character string having the same length as the input character string and appearing in the character string stored in the database is converted to a similar character string.
- a character string search device for searching comprising: a character string input unit that inputs a character string having a length exceeding the predetermined length; and an input character string that is a character string input to the character string input unit.
- a hamming distance input unit for inputting a hamming distance indicating the number of alphabets for which replacement with the alphabet is not performed, and a partial character string of the input character string, wherein the predetermined length is
- the identification unit that identifies two different partial character strings and the remaining part, and the partial character string identified by the identification unit and the remaining part are input by the no and mining distance input units.
- an assignment unit that divides and assigns the Hamming distance that has been assigned, and replaces the alphabet of the number indicated by the Hamming distance assigned by the assignment unit with an incompatible alphabet among the two partial character strings identified by the identification unit.
- a selection unit that selects the one with a smaller total number of replacement character strings, which is a character string generated by performing an operation on the partial character string; and assigning the partial character string selected by the selection unit to Generating a replacement character string having a Hamming distance assigned by the unit; and performing a search using the index using the replacement character string generated by the replacement character string generation unit as a search key.
- a character string search device having a search unit for performing the search can be provided.
- the technique of the present invention can be used for similarity search of peptide sequences, that is, for searching for peptides similar to the input peptide sequence. .
- Embodiment 10 of the present invention an embodiment will be described in which the base sequence search device of any of Embodiments 1 to 8 is improved with respect to the search for repeat sequences.
- FIG. 19 illustrates a functional block diagram of the base sequence search device according to Embodiment 10 of the present invention.
- the base sequence search device according to any one of Embodiments 1 to 8 has a repeat sequence storage unit 1901 and a repeat sequence information storage unit 1902, and the search unit 407 It has a repeat sequence determining means 1903 and a repeat sequence search means 1904.
- FIG. 19 is a functional block diagram when the base sequence search device according to the first embodiment includes these units and means.
- the "repeat sequence accumulation unit" 1901 accumulates the nucleotide sequence of the predetermined length repeatedly appearing in the gene nucleotide sequence.
- the “predetermined length” is a value determined by an index used by the base sequence search device, and is a length of the base sequence at which position in the gene base sequence that the base sequence appears can be searched by the index. .
- Fig. 20 illustrates a state where a nucleotide sequence that repeatedly appears in a gene nucleotide sequence is stored in a table.
- identifiers that uniquely identify nucleotide sequences that repeatedly appear in gene nucleotide sequences and their nucleotide sequences on the same line. identifiers and nucleotide sequences are stored in association with each other in a table.
- Repeat sequence information storage unit 1902 stores repeat sequence information.
- the repeat sequence information is information in which a base sequence stored in the repeat sequence storage unit 1901 is associated with an appearance position of the base sequence in the gene sequence.
- FIG. 21 illustrates a table for storing repeat sequence information.
- the identifier used in the table of FIG. 20 and the position where the nucleotide sequence appears in the gene nucleotide sequence are shown. By storing them in the same row, the association is performed.
- the column named “repeat sequence identifier” the identifier is stored, and in the column named “appearance position”, the position where the nucleotide sequence appears in the gene nucleotide sequence is stored.
- Repeat sequence determination means 1903 determines whether or not the replacement base sequence generated by the replacement base sequence generation unit 406 is stored in the repeat sequence storage unit 1901. For example, it is checked whether or not a substituted base sequence is stored in a column named “repeat sequence” in the table of FIG. In this process, an index (for example, by B + tree) that has a base sequence stored in a column named “repeat sequence” as a key and an identifier stored in a column named “repeat sequence identifier” as a value Configuration) can be performed at high speed.
- the base sequence determined by the repeat sequence determining means 1903 to be stored in the repeat sequence storage unit 1901 is referred to as a repeat sequence.
- the "repeat sequence search means" 1904 includes a repeat sequence information storage unit 1902 when the repeat sequence determination means 1903 determines that the replacement base sequence is stored in the S repeat sequence storage unit 1901.
- a search is performed based on the repeat sequence information stored in the. For example, the identifier stored in the column of repeat sequence identifier is obtained from the table in FIG. 20, the occurrence position is obtained from the table in FIG. 21, and the base sequences before and after the occurrence position in the gene base sequence are obtained.
- the search is performed by determining whether the base sequence is within a predetermined Hamming distance from the input base sequence.
- FIG. 22 exemplifies a flowchart for explaining the flow of processing in the search unit of the base sequence search device of FIG. 19 according to the present embodiment.
- the repeat sequence determining means determines whether or not the replacement base sequence is a repeat sequence. If it is a repeat array (that is, if it branches to YES in step S2201), the process proceeds to step S2202, and a search is performed by repeat array search means 1904 based on the repeat array information. If the sequence is not a repeat sequence (that is, if branching to NO in step S2201), the process proceeds to step S2203 to search for a similar base sequence according to the first to eighth embodiments. Further, it is also possible to perform no search if the sequence is a repeat sequence, and to search only when it is determined that the sequence is not a repeat sequence. (Embodiment 10: Main effects)
- the replacement base sequence is a repeat sequence
- by performing a search process for the repeat sequence it is possible to prevent a reduction in search speed due to the repeat sequence.
- Embodiment 11 of the present invention a base sequence search device that accumulates search results for similar base sequences will be described.
- FIG. 23 illustrates a functional block diagram of a base sequence search device according to Embodiment 11 of the present invention.
- the base sequence search device according to the present embodiment has a configuration in which the base sequence search device according to any one of Embodiments 4 to 7 includes a similar base sequence accumulation unit 2301.
- FIG. 23 is a functional block diagram when the base sequence search device according to the fourth embodiment has a similar base sequence accumulation unit 2301.
- the "similar base sequence accumulation unit” 2301 determines the hamming distance force S hamming distance input unit 402 between the input base sequence and the similar base sequence acquired by the similar candidate base sequence acquisition unit 1401 in the determination unit 1402. If it is determined that the distance is less than or equal to the Hamming distance input to (1), the input base sequence, (2) the Hamming distance between the input base sequence and its similar base sequence, and (3) the similar base sequence And are stored in association with each other.
- FIG. 24 shows the data for storing (1) the input base sequence, (2) the Hamming distance between the input base sequence and the similar base sequence, and (3) the similar base sequence in association with each other.
- input base sequence the Hamming distance between the input base sequence and the similar base sequence
- similar base sequence the similar base sequence in association with each other.
- FIG. 25 exemplifies a flowchart for explaining the processing flow of the determination unit and the similar base sequence accumulation unit of the base sequence search device according to the present embodiment.
- the determining unit determines whether the Hamming distance between the input base sequence and the similar base sequence is the input Hamming distance. If so, branch to the YES branch of step S2501.
- step S2502 (1) the input base sequence, (2) the Hamming distance, and (3) the similar base sequence are stored in the similar base sequence storage unit 2301 in association with each other.
- step S2502 is not executed.
- the search results of the base sequence search device are stored in the similar base sequence storage unit 2301, whether the search has already been performed for the same input base sequence and the same Hamming distance as the search target is performed. By retrieving and judging the information stored in the similar base sequence storage unit 2301, similar base sequences can be searched efficiently.
- the base sequence search apparatus according to the present embodiment is particularly useful, for example, when providing search services to a large number of people via the Internet or the like. For example, if a first person performs a search and then a second person performs the same search, the second person can divert the search results provided to the first person, The response time can be shortened and the load on the base sequence search device can be reduced.
- the ⁇ association rate '' is a value that indicates the percentage of the two types of base sequences that bind when the two types of base sequences are placed in a fluid environment such as a liquid. It is.
- Such a value can be calculated by performing a physicochemical calculation from the base sequence. For example, the calculation method is disclosed in the document cited as Non-Patent Document 1 described above.
- FIG. 26 illustrates a functional block diagram of a base sequence search device according to Embodiment 12 of the present invention.
- the base sequence search device according to the present embodiment has a configuration in which any one of the base sequence search devices according to Embodiments 4 to 7 includes an association rate calculation unit 2601.
- FIG. 26 is a functional block diagram in the case where the base sequence search device according to the fourth embodiment includes an association rate calculation unit 2601.
- the “association rate calculation unit” 2601 calculates the Hamming distance between the similar candidate base sequence acquired by the similar candidate base sequence acquisition unit 1401 and the input base sequence input by the base sequence input unit 401. When it is determined that the distance is less than or equal to the Hamming distance input to the mining distance input unit 402, (1) the input base sequence input by the base sequence input unit 401 and (2) the similar candidate base sequence obtaining unit The association rate with the similar candidate nucleotide sequence obtained in 1401 is calculated. For example, conditions such as liquid temperature and pH are set beforehand, and the association rate under those conditions is calculated physicochemically. When calculating the association rate, bases constituting the input base sequence or bases constituting the similar candidate base sequence are replaced with complementary bases.
- the base sequence search device of the present invention can efficiently search for base sequences whose hamming distance is less than or equal to a predetermined value with respect to the input base sequence, and how much the association rate and the actual number of wet base experiments are. The prediction of the effect of the drug using experimental results and RNA interference.
- Embodiment 13 of the present invention describes an apparatus for searching for a base sequence that can be used as a control in a wet experiment or the like.
- FIG. 27 illustrates a functional block diagram of an ineffective base sequence generator according to Embodiment 13 of the present invention.
- the ineffective base sequence generator 2700 includes a base sequence acquisition unit 2701, an ineffective candidate replacement base sequence generation unit 2702, an ineffective candidate replacement base sequence input unit 2703, a second no, a mining distance input unit 2704, and a selection unit. Unit 2705.
- Base sequence obtaining unit 2701 obtains a base sequence having a length exceeding the predetermined length.
- the “predetermined length” is, as described in Embodiment 10, a value determined by an index used by the base sequence search device according to any of Embodiments 4 to 7, and the base sequence of the gene base sequence It is the length of a base sequence that can be searched for by its index to appear at the position.
- the base sequence acquisition unit is connected to, for example, a client device via a communication network, and acquires a base sequence input to a web browser or the like operating on the client device.
- the base sequence obtained by the base sequence obtaining unit 2701 is, for example, a base sequence that has been found to be V that does not function as a target mRNA.
- the "ineffective candidate replacement nucleotide sequence generator” 2702 generates an ineffective candidate replacement nucleotide sequence.
- the “ineffective candidate substitution base sequence” is a base sequence obtained by substituting a predetermined number of bases in the base sequence obtained by the base sequence obtaining unit. For example, if the base sequence length is 21 and the predetermined number is 3, the number of (4 I) 3 C ineffective candidate positions
- a replacement base sequence (4 in “4-1” indicates that it is the type of base).
- a base sequence predicted to have a low association rate with the target mRNA base sequence may be generated based on special knowledge rather than all ineffective candidate substitution base sequences.
- an invalidation candidate substitution base sequence may be generated using a sequence having a small number of appearances.
- the “ineffective candidate replacement base sequence input unit” 2703 inputs the ineffective candidate replacement base sequence generated by the ineffective candidate replacement base sequence generation unit 2702 to the base sequence search device 2706 according to Embodiment 12. I do. For example, if the ineffective base sequence generation device and the base sequence search device according to the twelfth embodiment are connected by a LAN or the like, information indicating the ineffective candidate replacement base sequence is sent to the base sequence search device according to the twelfth embodiment. Send
- the “second nominal distance input unit” 2704 inputs a predetermined Hamming distance to the base sequence search device 2706 to which the ineffective candidate replacement base sequence input unit 2703 has input the ineffective candidate replacement base sequence. For example, when the ineffective candidate substitution base sequence input unit 2703 inputs the ineffective candidate substitution base sequence, a predetermined Hamming distance is input.
- the "selection unit" 2705 selects a base sequence with a low association rate obtained from the base sequence search device 2706 based on the input of the ineffective candidate substitution base sequence input unit and the input of the second Hamming distance input unit 2704. I do. For example, the association rate between a certain ineffective candidate substitution base sequence and a similar base sequence similar thereto is 50%, and the association ratio between another ineffective candidate substitution base sequence and a similar base sequence similar thereto is 10%. If there is, the latter ineffective candidate substitution base sequence is selected and displayed as a base sequence with no effect to the user of the ineffective base sequence generator.
- FIG. 28 is a flowchart illustrating the processing flow of the ineffective base sequence generator according to the present embodiment.
- the base sequence is obtained by the base sequence obtaining unit 2701.
- an ineffective candidate substitution base sequence is generated by the ineffective candidate substitution base sequence generation unit 2702.
- the base The ineffective candidate substitution base sequence and a predetermined Hamming distance are input to the sequence search device 2706.
- Step S2803 is performed once for each ineffective candidate replacement nucleotide sequence, and an association rate is obtained for each ineffective candidate replacement nucleotide sequence.
- an ineffective candidate substitution base sequence having a low association rate is selected by the selection unit 2705.
- the nucleotide sequence obtained by the selection is presumed to be a nucleotide sequence having no effect, it can be used as a control in a wet experiment.
- Embodiment 14 of the present invention an apparatus for performing base sequence alignment using the base sequence search apparatus of the present invention will be described.
- FIG. 29 is a diagram for explaining an outline of a process performed by the device according to the fourteenth embodiment of the present invention.
- a gene base sequence 2901 there is a gene base sequence 2901, and it is desired to know in which part of this sequence a base sequence similar to the base sequence 2902 exists.
- a partial sequence 2903 of the base sequence 2902 is obtained.
- the length of the partial sequence 2903 is a length suitable for the nucleotide sequence search apparatus of the present invention, and is preferably 15 to 25.
- a similar nucleotide sequence 2904 of the partial sequence 2903 is found in the gene nucleotide sequence 2901.
- the base sequences before and after the partial sequence 2903 and the similar base sequence 2904 are compared using a conventionally known method such as dynamic programming.
- a conventionally known method such as dynamic programming.
- FIG. 30 illustrates a functional block diagram of a base sequence alignment apparatus according to Embodiment 14 of the present invention.
- the base sequence alignment apparatus 3000 has a second base sequence acquisition unit 3001, a partial base sequence selection unit 3002, a partial base sequence input unit 3003, a third Hamming distance input unit 3004, and an alignment unit 3005.
- “Second nucleotide sequence acquisition unit” 3001 acquires a nucleotide sequence exceeding the predetermined length.
- "Partial nucleotide sequence selection unit” 3002 selects a partial nucleotide sequence that is a part of the nucleotide sequence acquired by second nucleotide sequence acquisition unit 3001. For example, a base sequence having a length of 15 to 25 is selected from the base sequences obtained by the second base sequence obtaining unit 3001. It is desirable that the obtained partial base sequence does not become the repeat sequence described in the twelfth embodiment. This is because a large number of alignment candidates are found and step S3104 described later must be executed many times. Therefore, as in Embodiment 12, the repeat sequence storage unit is provided in the base sequence alignment device, and the partial base sequence is obtained by referring to the content stored in the repeat sequence storage unit. You may.
- the "partial base sequence input unit" 3003 inputs the partial base sequence selected by the partial base sequence selection unit to the base sequence search device 3006 according to any one of Embodiments 4 to 8.
- the “third number, mining distance input unit” 3004 inputs a predetermined Hamming distance to the base sequence search device 3006 to which the partial base sequence input unit has input the partial base sequence.
- a similar base sequence of the partial base sequence is obtained, and the position in the gene base sequence is obtained.
- the "alignment unit" 3005 is based on a search result obtained from the base sequence search device 3006 by performing the input by the partial base sequence input unit 3003 and the input by the third Hamming distance input unit 3004, Then, the base sequence obtained by the second base sequence obtaining unit 3001 is aligned with the gene base sequence. For example, assuming that the partial base sequence is the portion indicated by reference numeral 2903 and the base sequence search device 3006 determines that the similar base sequence to the partial base sequence is the portion indicated by reference numeral 2904, The base sequence before and after the base sequence to be performed and the base sequence indicated by reference numeral 2902 are calculated using a dynamic programming technique or the like to indicate a score value indicating how similar the base sequence is.
- FIG. 31 is a flowchart illustrating a processing flow of the base sequence alignment apparatus of FIG. 30 according to the present embodiment.
- the second nucleotide sequence acquisition unit 3101 acquires a nucleotide sequence.
- the partial base sequence selection unit 3 At 002, a partial base sequence is selected.
- the partial base sequence and the Hamming distance are input to the base sequence search device 3006 by the partial base sequence input unit 3003 and the third Hamming distance input unit 3004.
- the base sequence is aligned with the gene base sequence based on the search result by base sequence search device 3006.
- Step S3104 is repeatedly executed for only the search result obtained in step S3103.
- BLAST etc. were used.For example, by using BLAST etc., for example, a search for a base sequence where consecutive 7-mers are matched can be performed to determine where similar base sequences appear in the gene base sequence. In some cases, it was difficult to perform accurate alignment. In the present invention, since a similar base sequence of the partial base sequence is searched, more accurate alignment can be performed.
- the base sequence search device and the base sequence search method according to the present invention can reduce the amount of calculation required for the search, and the no and mining distances are equal to or less than a predetermined value, that is, It is useful for designing a base sequence, etc., because it does not overlook the existence of a similar base sequence.
- a predetermined value that is, It is useful for designing a base sequence, etc., because it does not overlook the existence of a similar base sequence.
- various predetermined guidelines specifically, it is possible to design an siRNA having a high RNA interference (RNAi) effect
- RNAi RNA interference
- FIG. 1 Schematic representation of the process of RNA interference
- FIG. 17 No. Example of program for dividing and assigning the mining distance
- FIG. 22 is a flowchart of a process performed by a search unit of the base sequence search device according to the tenth embodiment of the present invention.
- FIG. 28 is a flowchart of a process performed by the ineffective base sequence generator according to Embodiment 13 of the present invention.
- FIG. 29 is a schematic diagram of processing by an apparatus according to Embodiment 14 of the present invention.
- FIG. 30 is a functional block diagram of a base sequence alignment apparatus according to Embodiment 14 of the present invention.
- FIG. 31 is a flowchart of a process performed by the nucleotide sequence alignment apparatus according to the fourteenth embodiment of the present invention.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006511830A JP4614949B2 (ja) | 2004-03-31 | 2005-03-31 | 塩基配列検索装置及び塩基配列検索方法 |
US10/594,644 US20080263002A1 (en) | 2004-03-31 | 2005-03-31 | Base Sequence Retrieval Apparatus |
EP05727509A EP1732022A4 (en) | 2004-03-31 | 2005-03-31 | APPARATUS FOR RECOVERING A BASIC SEQUENCE |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004-108456 | 2004-03-31 | ||
JP2004108456 | 2004-03-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005096208A1 true WO2005096208A1 (ja) | 2005-10-13 |
Family
ID=35063999
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/006397 WO2005096208A1 (ja) | 2004-03-31 | 2005-03-31 | 塩基配列検索装置及び塩基配列検索方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20080263002A1 (ja) |
EP (1) | EP1732022A4 (ja) |
JP (1) | JP4614949B2 (ja) |
WO (1) | WO2005096208A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108140071A (zh) * | 2015-10-21 | 2018-06-08 | 相干逻辑公司 | 使用分级反向索引表的dna比对 |
US11222712B2 (en) | 2017-05-12 | 2022-01-11 | Noblis, Inc. | Primer design using indexed genomic information |
US11308056B2 (en) | 2013-05-29 | 2022-04-19 | Noblis, Inc. | Systems and methods for SNP analysis and genome sequencing |
WO2022244089A1 (ja) | 2021-05-18 | 2022-11-24 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101482011B1 (ko) * | 2012-10-29 | 2015-01-14 | 삼성에스디에스 주식회사 | 염기 서열 정렬 시스템 및 방법 |
KR101508816B1 (ko) * | 2012-10-29 | 2015-04-07 | 삼성에스디에스 주식회사 | 염기 서열 정렬 시스템 및 방법 |
-
2005
- 2005-03-31 US US10/594,644 patent/US20080263002A1/en not_active Abandoned
- 2005-03-31 JP JP2006511830A patent/JP4614949B2/ja active Active
- 2005-03-31 EP EP05727509A patent/EP1732022A4/en not_active Withdrawn
- 2005-03-31 WO PCT/JP2005/006397 patent/WO2005096208A1/ja active Application Filing
Non-Patent Citations (4)
Title |
---|
LI M. ET AL: "Finding Similar Regions In Many Strings.", PROC.ANNU. ACM SYMP. THEORY COMPUT., vol. 31, 1999, pages 473 - 482, XP002989474 * |
NAVARRO G.A. ET AL: "Guided Tour to Approximate String Matching.", ACM COMPUTING SURVEYS., vol. 33, no. 1, 2001, pages 31 - 88, XP002235679 * |
See also references of EP1732022A4 * |
UI-TEI K. ET AL: "Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference.", NUCLEIC ACIDS RESEARCH., vol. 32, no. 3, 9 February 2004 (2004-02-09), pages 936 - 948, XP002329955 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11308056B2 (en) | 2013-05-29 | 2022-04-19 | Noblis, Inc. | Systems and methods for SNP analysis and genome sequencing |
CN108140071A (zh) * | 2015-10-21 | 2018-06-08 | 相干逻辑公司 | 使用分级反向索引表的dna比对 |
JP2018535484A (ja) * | 2015-10-21 | 2018-11-29 | コーヒレント・ロジックス・インコーポレーテッド | 階層的転置索引表を使用したdnaアラインメント |
CN108140071B (zh) * | 2015-10-21 | 2022-04-29 | 相干逻辑公司 | 使用分级反向索引表的dna比对 |
US11594301B2 (en) | 2015-10-21 | 2023-02-28 | Coherent Logix, Incorporated | DNA alignment using a hierarchical inverted index table |
US11222712B2 (en) | 2017-05-12 | 2022-01-11 | Noblis, Inc. | Primer design using indexed genomic information |
WO2022244089A1 (ja) | 2021-05-18 | 2022-11-24 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2005096208A1 (ja) | 2008-02-21 |
EP1732022A4 (en) | 2008-09-24 |
US20080263002A1 (en) | 2008-10-23 |
EP1732022A1 (en) | 2006-12-13 |
JP4614949B2 (ja) | 2011-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8178503B2 (en) | Ribonucleic acid interference molecules and binding sites derived by analyzing intergenic and intronic regions of genomes | |
Schbath et al. | Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis | |
Heyne et al. | GraphClust: alignment-free structural clustering of local RNA secondary structures | |
Grover et al. | Searching microsatellites in DNA sequences: approaches used and tools developed | |
Rahn et al. | Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop | |
WO2005096208A1 (ja) | 塩基配列検索装置及び塩基配列検索方法 | |
Frid et al. | A simple, practical and complete O-time Algorithm for RNA folding using the Four-Russians Speedup | |
Wang et al. | A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements | |
US8065091B2 (en) | Techniques for linking non-coding and gene-coding deoxyribonucleic acid sequences and applications thereof | |
Wienbrandt et al. | Using the reconfigurable massively parallel architecture COPACOBANA 5000 for applications in bioinformatics | |
Frid et al. | An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding | |
Subramaniyan et al. | Accelerating maximal-exact-match seeding with enumerated radix trees | |
US20200265923A1 (en) | Efficient Seeding For Read Alignment | |
JP2003256433A (ja) | 遺伝子構造解析方法およびその装置 | |
JP4991287B2 (ja) | 特異的塩基配列探索方法 | |
Martin et al. | Fast and accurate genome-scale identification of DNA-binding sites | |
JP7393439B2 (ja) | 遺伝子シークエンシングデータ処理方法及び遺伝子シークエンシングデータ処理装置 | |
Biswas et al. | PR2S2Clust: patched rna-seq read segments’ structure-oriented clustering | |
Aguena et al. | A Survey on Solutions for Planted Motif Search Challenging Instances | |
Kamarudin et al. | A Review of Bioinformatics Model and Computational Software of Next Generation Sequencing | |
WO2023021205A1 (en) | Computer-implemented methods and systems for transcriptomics | |
Khan et al. | AI and Genomes for Decisions Regarding the Expression of Genes | |
Chang et al. | The application of alternative splicing graphs in quantitative analysis of alternative splicing form from EST database | |
Zhao et al. | Identifying TF Binding Motifs from a Partial Set of Target Genes and its Application to Regulatory Network Inference | |
Sarje et al. | Parallel algorithms for alignments on the cell be |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2006511830 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2005727509 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2005727509 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10594644 Country of ref document: US |