WO2019022019A1 - Méthode de détection d'insertion, de délétion, d'inversion, de translocation et de substitution - Google Patents

Méthode de détection d'insertion, de délétion, d'inversion, de translocation et de substitution Download PDF

Info

Publication number
WO2019022019A1
WO2019022019A1 PCT/JP2018/027536 JP2018027536W WO2019022019A1 WO 2019022019 A1 WO2019022019 A1 WO 2019022019A1 JP 2018027536 W JP2018027536 W JP 2018027536W WO 2019022019 A1 WO2019022019 A1 WO 2019022019A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
data
partial
sequence data
target
Prior art date
Application number
PCT/JP2018/027536
Other languages
English (en)
Japanese (ja)
Inventor
安藝雄 宮尾
Original Assignee
国立研究開発法人農業・食品産業技術総合研究機構
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立研究開発法人農業・食品産業技術総合研究機構 filed Critical 国立研究開発法人農業・食品産業技術総合研究機構
Priority to JP2019532604A priority Critical patent/JP7122006B2/ja
Publication of WO2019022019A1 publication Critical patent/WO2019022019A1/fr

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism

Definitions

  • the present invention relates to the field of information processing of sequence information, especially sequence information of biomolecules such as genomes.
  • next-generation sequencers whole genome sequence information of organisms has come to be obtained.
  • polymorphism information from the sequence information of the next-generation sequencer and examining the association with the phenotype, it leads to the identification of the gene responsible for the phenotype.
  • Acquisition of accurate polymorphism information is a basic technology that is required in a wide range of fields, such as diagnosis of human genetic diseases, identification of species and varieties, etc., as well as crop breeding. If type information can be obtained, the impact is large.
  • Polymorphism detection using nucleotide sequence data from the next-generation sequencer is performed by first obtaining the positional information and mismatch information on the reference sequence using a mapping program such as bwa or bowtie for the sequence data.
  • a mapping program such as bwa or bowtie for the sequence data.
  • polymorphism information such as SNP and indel is extracted by polymorphism extraction programs such as Samtools and GATK.
  • methods are provided for detecting polymorphisms between two or more sequences.
  • the methods of the invention detect polymorphisms without the need for linking individual sequences (e.g., short reads from next-generation sequencers) in the sequence data into longer sequences (e.g., assembly).
  • One feature is that it can be done.
  • the method of the invention extends the partial sequence on the target sequence matched to one sequence (for example, the reference genome) from the partial sequence on the target sequence to the portion where a mismatch occurs when the comparison between the target sequence and the reference occurs. It characterizes and this determines the junction of the mutation.
  • the present invention can detect large deletions and inversions and translocations that were difficult to detect by the conventional method, and that the detection results themselves can be visually confirmed as an alignment. It is part of the feature.
  • a method for detecting polymorphism in control sequence data in target sequence data comprising: a) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; c) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, it is determined that there is a target polymorphism, and the target sequence data in the target sequence data is determined.
  • a method comprising the steps of: sequentially comparing the characters between the partial sequence sites on the corresponding control sequence and the partial sequence sites as a starting point to detect non-matching sites.
  • the target sequence data and the control sequence data are base sequence data.
  • the method according to any of the items, wherein the target sequence data is sequence data obtained by next-generation sequencing.
  • the polymorphism is an insertion, a deletion, an inversion, a translocation, or a substitution.
  • the step of determining that there is a target polymorphism is Determining that a translocation has occurred if the subsequence is present on a different sequence structure of the control sequence; If the partial sequence is present on the same sequence structure of the control sequence and the orientation is different from that on the target sequence data, it is determined that an inversion is present.
  • the partial sequence is present on the same sequence structure of the control sequence, the orientation is identical to that on the target sequence data, and the distance of the partial sequences is on the target sequence data on the control sequence. If it is shorter than the above distance, it is determined that a deletion is present, and / or the partial sequence is present on the same sequence structure of the control sequence, and the direction is identical to that on the target sequence data.
  • the method according to any of the above items comprising determining that an insertion is present if the distance of the subsequence is greater than the distance on the control sequence data on the control sequence. (Item 7) If there is no difference between the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence, determining that there is no polymorphism of interest. The method according to any of the preceding items. (Item 8) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are not different, characters between the partial sequence portions in the target sequence data are used.
  • any one of the above items further comprising the step of detecting a non-matching site as compared to the character on the corresponding control sequence, and determining that a substitution is present if a non-matching site is present.
  • Method described. (Item 9) The step of detecting a non-matching portion by sequentially comparing the characters between the partial sequence portions in the target sequence data with the characters on the corresponding control sequence and the partial sequence portion as a start point, Searching for a matching character upstream until a character not matching the character at the corresponding position in the control sequence is detected from the part of the partial sequence downstream in the target sequence data; Searching for a downstream matching character from the partial sequence on the upstream side of the target sequence data until a character that does not match the character at the corresponding position in the control sequence is detected.
  • searching for the matching character is a search for each character.
  • searching for the matching character is a search for each character.
  • searching for the matching character is a search for each character.
  • searching for the matching character is a search for each character.
  • searching for the matching character is a search for each character.
  • searching for the first character of the unmatched character matches Identifying the non-matching character detected as the boundary portion of polymorphism if the characters of 40% or more do not match in the subsequent 2 to 10 characters; Otherwise, ignoring the non-matching character and continuing the search for the matching character.
  • a program for causing a computer to execute a method of detecting a polymorphism in control sequence data in target sequence data the method comprising a) storing target sequence data and control sequence data in a computer; b) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; c) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; d) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, it is determined that there is a target polymorphism, and the target sequence data in the target sequence data is determined A program comprising the steps of: sequentially comparing characters on the
  • a recording medium storing a program for causing a computer to execute a method of detecting polymorphism of control sequence data in target sequence data, wherein the method comprises a) storing target sequence data and control sequence data in a computer; b) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; c) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; d) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, it is determined that there is a target polymorphism, and the target sequence data in the target sequence data is determined
  • a recording medium comprising the steps of: sequentially comparing characters between corresponding partial sequences, with characters between corresponding partial sequences, starting from the partial sequence as a starting point to detect un
  • a recording medium according to the item, having the features described in any one or more of the items.
  • a system for detecting polymorphism in control sequence data in target sequence data comprises: A sequence data providing unit configured to provide a computer with target sequence data and control sequence data; a) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; d) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, it is determined that there is a target polymorphism, and the target sequence data in the target sequence data is determined A sequence data calculation unit configured to perform a step of sequentially comparing the characters between the partial sequence sites with the characters on the corresponding control sequence and the partial sequence sites as a start point to detect a non-matching site; A system comprising: 16A
  • a method for detecting polymorphism in control sequence data in target sequence data comprising: (1) a) providing the frequency of appearance of each partial sequence of a subset of the length k subsequences of the target sequence data; b) providing the frequency of occurrence of each subsequence of the subset of subsequences of length k of the control sequence data; c) comparing the target sequence with the control sequence and detecting polymorphism based on the comparison of the frequency of occurrence, substitution, copy number polymorphism, STRP, insertion, deletion in the target sequence data; A process of detecting inversion or translocation; (2) a) specifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; c) When the positional relationship between the partial sequences in
  • Insertion, deletion, inversion in the target sequence data by the step of comparing the characters between the partial sequence sites with the characters on the corresponding control sequence and sequentially comparing them starting from the partial sequence site and detecting the mismatched sites.
  • the method of claim 16 having the features described in any one or more of the items.
  • (Item 18) A method for detecting polymorphism in reference sequence data in target sequence data, comprising the step of creating, from the reference sequence data, a k-length partial array set of reference sequences associated with each position information.
  • the method further includes the step of detecting a non-matching portion by comparing characters on the partial sequence site in the target sequence data with characters on the corresponding control sequence, when the positional relationship is not different, A process further comprising the step of determining that a substitution exists if there is a non-coincidence site; Method simultaneously, in parallel, or sequentially.
  • 18A The method of claim 18 having the features described in any one or more of the items.
  • a method for detecting polymorphism in control sequence data in target sequence data comprising: a) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; c) aligning the target sequence data with the control sequence if the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, Aligning the control sequence so that the target sequence data matches the position of the first partial sequence, and aligning the control sequence such that the target sequence data matches the position of the second partial sequence.
  • (Item A1A) A method according to any of the preceding items having the features described in any one or more of the items.
  • (Item A2) The method according to any of the items, wherein the control array data is reference array data.
  • (Item A3) The method according to any of the items, wherein the target sequence data and the control sequence data are base sequence data.
  • (Item A4) The method according to any of the items, wherein the target sequence data is sequence data obtained by next-generation sequencing.
  • (Item A5) The method according to any of the preceding items, wherein the polymorphism is an insertion, a deletion, an inversion, a translocation, or a substitution.
  • (Item B1) A method of determining the position of a target sequence on a control sequence, comprising: a) outputting positions in the sequence and control sequence for a plurality of k-length subsequences in the control sequence; b) outputting, for a plurality of k-length subsequences in the target sequence, the position in the sequence and the target sequence; c) comparing the sequences obtained in a) and b), and correlating the position in the control sequence corresponding to the identical partial sequence with the position in the subject sequence, where k is the subject sequence The length is not exceeding the length of the way.
  • (Item B1A) A method according to the preceding item, having the features described in any one or more of the items.
  • (Item B2) The method according to any of the items, wherein the control sequence data is reference sequence data.
  • (Item B3) The method according to any of the items, wherein the target sequence data and the control sequence data are base sequence data.
  • (Item B4) The method according to any of the items, wherein the target sequence data is sequence data obtained by next-generation sequencing.
  • (Item B5) A step of aligning the target sequence data and the control sequence, The control sequence is aligned so that the target sequence data matches the position of the first partial sequence of the target sequence, and the control sequence is aligned so that the target sequence data matches the position of the second partial sequence of the target sequence.
  • a method according to any of the preceding items further comprising the step of aligning.
  • (Item B6) The method according to any of the preceding items, wherein the aligning step comprises displaying the result of the alignment.
  • (Item B7) A control array is displayed above the target array data so that the position of the first partial array of the target array matches.
  • (Item B6) The method according to any of the items above, further comprising the step of detecting polymorphism in the target sequence data relative to the control sequence data based on the alignment.
  • (Item C1) A method of confirming a mutation of a target sequence suspected of having a mutation with respect to a control sequence, a) providing a set of L-length partial sequence data of the target sequence and a set of L'-length partial sequence data of the control sequence; b) a plurality of partial sequences including a portion suspected of having a mutation in the reference sequence, positional information of the partial sequences, information on substitution, insertion, deletion, inversion and / or translocation, and Providing a set of indication of whether it corresponds to the L length of the sequence or the L 'length sequence of the control sequence, and an indication that the sequence does not contain a mutation, wherein L and L' are different Providing a set including a plurality of L-length subsequences and a set including a plurality of L′-length subsequences; c) a plurality of partial sequences including a portion in which a portion suspected of having a mutation in the reference sequence is converted to a mutated character,
  • (Item C1A) The method according to the above item, having the features described in any one or more of the items.
  • (Item C2) The method according to any of the items, wherein the target sequence and the control sequence are base sequences.
  • (Item C3) The method according to any of the items, wherein the target sequence data is sequence data obtained by next-generation sequencing.
  • (Item C4) The method according to any of the items, wherein the control sequence is sequence data obtained by next-generation sequencing.
  • (Item C5) The method according to any of the items, wherein the control sequence is a reference sequence, and the set of L′-length partial sequence data is a set of L′-length partial sequences of the reference sequence.
  • (Item C6) The method according to any of the preceding items, wherein the mutation is an insertion, a deletion, an inversion, a translocation, or a substitution.
  • (Item D1) A comparison method between a control sequence and a target sequence, The control sequence comprises at least two partial sequences identical to at least two partial sequences in the subject sequence, Aligning the control sequence with the subject sequence so that the positions of the first partial sequence match; Aligning the control sequence with the subject sequence such that the positions of the second subsequences coincide.
  • (Item D1A) The method according to the above item, having the features described in any one or more of the items.
  • A. The method of any of the above items, wherein the aligning comprises one or more alignments that are in the opposite direction to the other alignment.
  • (Item D3) The method according to any of the items, wherein the comparison expresses polymorphism to the control sequence data in the target sequence data.
  • (Item D4) The control array is displayed above the target array data so that the position of the first partial array of the target array matches.
  • the method according to any of the above items, wherein the control sequence is displayed below the subject sequence data so that the position of the second partial sequence of the subject sequence matches.
  • (Item D5) The method according to any one of Items, wherein the comparison expresses the boundary between a polymorphic site and a non-polymorphic site in the target sequence data relative to the control sequence data.
  • (Item E1) A comparison method between a control sequence and a target sequence,
  • the control sequence comprises at least two partial sequences identical to at least two partial sequences in the subject sequence, For polymorphisms in the target sequence, The position on the control sequence which is not identical when the control sequence is aligned with the target sequence so that the position of the first partial sequence matches.
  • a method comprising assigning, as an identifier, a position on the control sequence which is not identical when the control sequence is aligned with the target sequence so that the position of the second partial sequence matches.
  • (Item E1A) A method according to the preceding item, having the features described in any one or more of the items.
  • (Item F1) A comparison method between a control sequence and a target sequence,
  • the control sequence includes N partial sequences identical to N partial sequences in the target sequence, where N is an integer of 2 or more, 1.
  • the method comprises aligning a control sequence with the subject sequence such that the position of the nth subsequence matches.
  • (Item F1A) The method according to the above item, having the features described in any one or more of the items.
  • the aligning comprises one or more alignments that are in the opposite direction to the other alignment.
  • polymorphisms in particular deletions, insertions, inversions and / or translocations can be detected exactly between two or more sequences.
  • FIG. 1 is a flow diagram illustrating one specific example of an embodiment of the method of the present invention.
  • FIG. 2 shows the results of polymorphisms detected in the data in which mutations were introduced to the rice reference sequence (IRGSP1.0).
  • Chr is a chromosome number
  • Top is a top strand (sequence of 5 'to 3' of the base sequence)
  • Bottom is an insertion / deletion junction of bottom strand (complementary strand)
  • Size is a size of insertion / deletion (deletion is Negative notation)
  • Reads is the number of next generation sequencer reads (sequences imitating) detected at similar positions and sizes.
  • FIG. 3 is a view showing the result of detection of polymorphism in Example 3 of the present specification.
  • FIG. 4A schematically illustrates one embodiment of the method of the present invention.
  • FIG. 4B schematically shows an embodiment of the method of the present invention.
  • FIG. 4C schematically illustrates one embodiment of the method of the present invention.
  • FIG. 5A schematically shows an embodiment of the system of the present invention.
  • FIG. 5B schematically illustrates a further embodiment of the system of the present invention.
  • FIG. 6 is a flow chart showing one embodiment in the case of combining the polymorphism detection flow using the frequency of the k-mer sequence and the polymorphism detection flow using the positional relationship of partial arrays.
  • FIG. 7 summarizes the detection results of the method of the present invention and the method using Samtools in the data in which mutations were introduced to the rice reference sequence (IRGSP1.0).
  • FIG. 8 summarizes the detection results by the method using Samtools in the data in which mutations were introduced to the rice reference sequence (IRGSP1.0).
  • array refers to a plurality of variables, each of which takes some value, which further includes information of the order of the plurality of variables. Typically, it is displayed as a character string.
  • target sequence refers to any sequence for which polymorphism is to be detected, and may also be referred to as “target”, “target sequence”, or “target” in the present specification.
  • control sequence refers to any sequence used as a reference for detecting differences from the sequence as polymorphisms, and as used herein, “control”, “reference sequence”, It may be written as “comparison sequence” or "control”.
  • polymorphism refers to any portion of the subject sequence that differs from the control sequence.
  • mutation can also be used in the same meaning.
  • reference sequence refers to a sequence that can be treated as a full length sequence of the subject sequence and / or control sequence. Which sequence is to be used as a full-length sequence is appropriately determined according to the sequence used as the target sequence and / or control sequence, and is not limited to the exemplified ones, but is present in, for example, a database on the web A whole genome sequence, a chromosome full length sequence, a gene full length sequence, a plasmid full length sequence, an exon full length sequence, a protein full length sequence or the like can be used as a reference sequence.
  • sequence data refers to data that gives information about a certain sequence. Typically, the sequence itself can also be referred to as sequence data, and data giving information on a part of the sequence (eg, analysis data obtained by sequencing of genomic sequences) is also included as sequence data.
  • partial sequence of a sequence refers to any sequence contained in the sequence.
  • subset refers to any subset of a set of sequences and a set of subsequences of those sequences.
  • next-generation sequencing is a sequencing technique that parallelizes the sequencing process to generate tens to hundreds of millions of sequencing data in a single run.
  • Next-generation sequencer refers to an apparatus for performing next-generation sequencing.
  • coverage refers to how many times the amount of sequence data corresponds to the total sequence length. It may be called “coverage”, “-double reading”, etc.
  • sequence structure refers to a series of physically separated sequences in a sequence.
  • sequence construct each of the chromosomes can be referred to as a sequence construct.
  • translocation refers to a polymorphism in which a partial sequence on one sequence structure is moved on another sequence structure in a sequence having a plurality of sequence structures.
  • junction refers to the boundaries between identical and non-identical parts of two sequences which are partially identical.
  • identifier refers to a name given to distinguish one polymorphism from another polymorphism. Generally, although it is often described by the start position and type of polymorphism, the identifier described herein can be used.
  • edge refers to the end of the portion of the sequence that contains the polymorphism.
  • a method for detecting polymorphism in control sequence data in target sequence data may include the step of specifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data.
  • This method can include the step of comparing the positional relationship between partial sequences in the target sequence data with the positional relationship between partial sequences on the control sequence.
  • the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are not different, it can be determined that there is no target polymorphism. If the positional relationship is different, it can be determined that there is a target polymorphism.
  • a step of sequentially comparing the characters between the partial sequence sites in the target sequence data with the characters on the corresponding control sequence and the partial sequence site as a start point to detect a mismatched site May be included.
  • the method of the present invention exhibits enhanced detection power over the prior art.
  • An example of an embodiment of the method of the present invention is illustrated in FIGS. 4A-C.
  • a step of detecting a non-matching portion by comparing the character on the control sequence with the character on the corresponding control sequence and the partial sequence portion in the target sequence data is included. It is possible to determine that there is a polymorphism when a noncoincidence part is detected, and to determine that there is no polymorphism when no noncoincidence part is detected (FIG. 4C). In this case, the comparison does not necessarily start from the partial sequence, and comparison can be made with the entire sequence (for example, short read sequence) of the target data.
  • sequences near both ends of a target sequence are positioned on a control sequence (eg, a genomic sequence) and aligned inward from both directions.
  • Method bidirectional alignment
  • sequences near both ends can be aligned relatively easily, and exhaustively identifying deletions and additions.
  • Can be advantageous because Since the range in which the polymorphism can be detected narrows as alignment starts from the inner side, it is considered that the polymorphism detection efficiency becomes higher as it approaches both ends.
  • the reading accuracy of the sequencer may fall near the 3 'end and the position may not be determined.
  • several bases eg, 0, 5, 10, 15 bases etc.
  • the "two-way alignment method" can also be viewed as a method of detecting edges of polymorphisms. If the distance between the aligned positions of both ends and the distance between the positions mapped on the reference sequence are different, it means that there is an insertion / deletion in the target sequence. Translocations can be detected if both ends match different chromosomes, and inversions can be detected if the orientation is reversed on the same chromosome.
  • the two-way alignment method is very fast and can be analyzed in one computer with realistic time.
  • the two-way alignment method can be performed on the same computer, much faster than bwa, and much shorter than when analyzed with Samtools and GATK.
  • the bi-directional alignment method can operate even in a relatively small memory environment (eg, about 4 Gbytes).
  • control sequence data is reference sequence data.
  • subject sequence data and / or control sequence data is base sequence data.
  • the subject sequence data may be sequence data obtained by next-generation sequencing.
  • the polymorphisms that can be detected by the present invention include, but are not limited to, substitutions, insertions, deletions, inversions, or translocations.
  • substitutions substitutions, insertions, deletions, inversions, or translocations.
  • translocation the following: When a partial sequence is present on a different sequence structure of a control sequence, it is determined that a translocation has occurred. The partial sequence is present on the same sequence structure of a control sequence, and the orientation is a target sequence. If it is different from that on the data, it is determined that there is an inversion.
  • a partial sequence is present on the same sequence structure of the control sequence, the orientation is identical to that on the target sequence data, and the partial sequence is If the distance of the target sequence is shorter than the distance on the target sequence data on the control sequence, it is determined that a deletion exists.
  • a partial sequence exists on the same sequence structure of the control sequence, and the direction is the target sequence data. 1 of the determination that an insertion is present if the partial sequence target sequence distance is longer than the target sequence data distance on the control sequence It is possible to perform more. In addition, or in place of these, substitution does not occur if a position where a letter between the partial sequence sites in the target sequence data does not match a letter on the corresponding control sequence is detected if the positional relationship is not different. It can be determined.
  • the method of the present invention can sensitively detect changes in simple sequence repeat (SSR).
  • SSR simple sequence repeat
  • One type of character sequence for example, poly A, poly C, poly G, poly G, etc.
  • repetition of 2 types of characters eg, CA repeat etc.
  • repetition of 3 types of characters e.g.
  • repetition of 4 types of characters Changes in the number of repetitions (for example, repetition of AGAT), repetition of five types of characters (for example, repetition of AATGG), etc. have been difficult to detect using conventional detection methods, but the method of the present invention detects with high sensitivity It is possible. Also, very large deletions, translocations and inversions can be detected.
  • the method of the present invention sequentially compares characters on partial sequence sites in the target sequence data with characters on the corresponding control sequence, starting from partial sequence sites as a starting point to detect non-matching sites. It is characterized by In the process of this detection, for example, Searching for a matching character upstream from a portion of the partial sequence downstream of the target sequence data until a character not matching the character at the corresponding position in the control sequence is detected, and / or upstream in the target sequence data From the partial sequence on the side, it may be included to search for a matching character downstream until a character not matching the character at the corresponding position in the control sequence is detected.
  • the search for matching characters may be performed by a fixed number of characters, for example, a search of 1 to 3 characters can be performed, and preferably a search of 1 character is performed.
  • the search may end there.
  • the method of the present invention may further include the step of searching for a match of the preceding characters if a non-matching character is detected.
  • the detected non-matching character is identified as the boundary of polymorphism, Otherwise, the non-matching characters can be ignored and the search for matching characters can continue.
  • the non-matching characters detected are It is possible to identify the border of the type and, otherwise, ignore the non-matching characters and continue searching for matching characters.
  • the detected non-matching character can be identified as a boundary portion of polymorphism.
  • a method for detecting polymorphism in control sequence data in target sequence data comprises positional relationship between partial sequences in target sequence data and positional relationship between the partial sequences on control sequence. If different, the step of aligning the target sequence data and the control sequence, the control sequence is aligned such that the position of the first partial sequence matches the position of the first partial sequence, and The method may include the step of aligning the control sequence so that the positions of the partial sequences in The step of aligning in the method may include displaying the result of the alignment. As the display, the control sequence is displayed above the target sequence data so that the position of the first partial sequence of the target sequence matches, and the position of the second partial sequence of the target sequence is displayed below the target sequence data.
  • Such a display can be stored as an image or as text data, and can be used as a method for expressing polymorphisms in polymorphism databases and the like. Such a display is useful in the transmission of information about polymorphisms.
  • the above method may comprise the features described elsewhere herein.
  • the method comprises the steps of: a) locating at least two partial sequences in the sequence of the target sequence data on the control sequence, and / or b) positional relationship between the partial sequences in the target sequence data, and the control sequence It may include the step of comparing the positional relationship between the partial sequences above.
  • the control sequence When the control sequence includes at least two partial sequences identical to at least two partial sequences in the target sequence, the control sequence is aligned with the target sequence so that the positions of the first partial sequences match.
  • the sequences can be compared by aligning the control sequence with the subject sequence so that the positions of the second partial sequence match.
  • a first sequence for example, a target sequence
  • a second sequence for example, a control sequence
  • two or more positions the number is not particularly limited, for example, two, three or four
  • This may include simultaneously performing multiple alignments aligned at matching positions of five places, six places, seven places, eight places or more.
  • the comparison can express the boundary between the polymorphic site and the non-polymorphic site in the target sequence data relative to the control sequence data.
  • Aligning with the control sequence at three or more positions may not be able to specify the position and not be able to align with only one pair, especially when the partial sequence is a repeat sequence, and it is effective in such a case. is there. If you can not align in one pair, you may be able to identify a unique location by slightly shifting the position of the partial array. For example, in the case of alignment only at both ends, the end portion may hit the sequence of the repeat region and positioning may not be performed, so the end may be constant (for example, 0, 5, 10, or 15 characters) The margin can be taken, etc., and the k-mer on the inner side can be used to specify the position on the reference genome and proceed with the alignment.
  • the sequence in the control sequence adjacent to the matching portion is obtained and aligned with the target sequence, but from the matching portion
  • the orientation of the acquired sequence seen is referred to herein as the "direction of alignment”.
  • the “direction of alignment” is the same, it is “forward”, and if the “direction of alignment” is different, it is “reverse”.
  • the “alignment direction” is relative, but in the case where the sequence has a direction (for example, the nucleic acid sequence has a 5 ′ ⁇ 3 ′ direction and the amino acid sequence has an N ⁇ C direction).
  • the “direction of alignment” may be mentioned with respect to the direction of the array itself).
  • one or more alignments include alignment in the opposite direction to the other alignments. This is because it is possible to obtain information on the junction of polymorphic portions existing between the control sequence and two or more matching portions of the target sequence by aligning both in the forward direction and the reverse direction. In addition, since acquiring the arrangement
  • a control sequence is displayed above the target sequence data so that the position of the first partial sequence of the target sequence matches, and below the target sequence data, the second partial sequence of the target sequence is displayed.
  • the control sequence is displayed so that the positions match (or vice versa), the display.
  • boundary locations junctions
  • a method of comparing a control sequence with a subject sequence wherein the control sequence comprises at least two partial sequences identical to at least two partial sequences in the subject sequence, the target sequence The position on the control sequence which is not identical when the control sequence is aligned with the target sequence so that the position of the first partial sequence is matched with the polymorphism in and the position of the second partial sequence on the control sequence
  • a method comprising assigning as an identifier a position on the control sequence that becomes unmatched when aligned with the subject sequence so as to match.
  • the present invention provides a program for implementing a method for causing a computer to execute the method of detecting polymorphism of the present invention, a recording medium recording the program, and a system for realizing the same.
  • Optional features that can be employed herein can employ any of the features described in the description of the method of detecting polymorphism or a combination thereof.
  • a method for detecting polymorphism in control sequence data in target sequence data.
  • the polymorphisms detected include, but are not limited to, insertions, deletions, inversions, or translocations.
  • the method may include the step of identifying at least two partial sequences in the sequence of the subject sequence data on the control sequence.
  • the length of the partial sequence in the sequence of the target sequence data can be a fixed length (k-mer).
  • k is not limited, but can be any value up to the length of each sequence of sequence data (for example, each short read of next-generation sequencer), for example, 500, 400, 300, 200, 100, 50, 40, 30, 25, 20, 15, etc. can be mentioned.
  • the position on the control sequence can be specified by searching the control sequence using a partial sequence in the target sequence data as a query.
  • the search may be linear search, binary search, interpolation search, hash search method or the like.
  • the search may be performed by the methods described herein (eg, Example B1).
  • Example B1 a method has been demonstrated in which the position and orientation in the control sequence of the partial sequence in the target sequence data are output by the unix join command (Example B1).
  • it is referred to as “Join method”, “Mapping by join (method)”, “MBJ (method), etc. about the positioning method including performing unix join command or equivalent processing.
  • the search data described in the present specification can be used in a binary search or in a Join method. While a bipartite search is the preferred search, the Join method is more preferred, as it allows faster mapping as described herein.
  • a k-long partial sequence set created from a control sequence can be suitably searched as search data (eg, Join method or binary search) .
  • the k-long partial sequence set prepared from the control sequence includes, in addition to the partial sequence, an identifier (for example, a chromosome number) of a sequence structure to which the partial sequence belongs, a position of the partial sequence (for example, the position of the first character) ), Orientation, etc. can be associated and created.
  • a search for example, Join method or binary search
  • the position on the control sequence of the partial sequence in the sequence of the target sequence data also includes the direction, and even if it matches the k-length partial sequence in the search data having the first character at the same position, If the k-length subsequences in the search data have different orientations, it is possible to detect a difference in orientation as a difference in position.
  • reference genome data for search can be created as follows: 1. Acquire k-mer while shifting one base each from the end of the nucleotide sequence data of each chromosome. Output k-mer, chromosome number, position of start base on genome, direction in one line. Output the complementary strand of k-mer, the chromosome number, the genome position of the top base, and the direction in one line. All data of the output normal strand and complementary strand are sorted lexicographically by k-mer sequence.
  • the partial sequence on each short read is referred to by performing a search using the target sequence data, for example, the k-mer partial sequence on the short read from the sequencer as a query on the reference genome data for search created in this way. It is possible to identify which position on the genome corresponds.
  • a partial array at a plurality of places as a partial array in the array of target sequence data.
  • the method of the present invention can include the identification of the positions on the control sequence of the two, three, four, five, six, seven or eight partial sequences. For positional comparison, specifying two positions is sufficient.
  • positions on control sequences of two partial sequences in the sequence of the target sequence data are specified.
  • the search for the position (or presence) in the control sequence of the partial sequence in the sequence of the target sequence data may be repeated multiple times for one sequence to specify the hit position on the control sequence.
  • the position of the partial array in the array of the target array data is not limited. Therefore, the partial array is acquired from the array of the target array data, the control array data is searched, and if there is no hit (for example, a hit to a unique position), the partial array is acquired from different parts of the same array. It is possible to do a search.
  • the partial array may be obtained from a position within 5, 10 or 15 characters from one end or both ends of the array. Furthermore, these partial sequences can be used sequentially as a query.
  • the position of the partial sequence on the short read sequence derived from the next-generation sequencer using the reference genome data for search it can be performed according to the flow as shown in FIG.
  • the k-mer is obtained from the 5 base internal base from both ends of the short read sequence, and binary search is performed on the reference genome data to determine whether both sides hit unique positions. 2. If the double-sided or single-sided k-mer did not hit a unique position, a binary search is performed with the k-mer from the base inside 10 bases to determine if both sides hit a unique position. 3.
  • a binary search is performed with the k-mer from the base inside 15 bases to determine if both sides hit a unique position. 4. If both sides hit a unique position, the corresponding reference sequence is acquired from the position information of the hit upstream and downstream k-mers.
  • the method of the present invention comprises the step of comparing the positional relationship between partial sequences in the target sequence data with the positional relationship between partial sequences on the control sequence.
  • the positional relationship includes the distance between two or more partial sequences, the array structure to which two or more partial sequences belong, the orientation of each of two or more partial sequences, and the like. If the positional relationship between partial sequences in the target sequence data does not differ from the positional relationship between partial sequences on the control sequence, it can be determined that there is no polymorphism of interest. In addition, it is possible to consider that each partial sequence in a certain sequence in the target sequence data is present on the same sequence structure.
  • a step of detecting a non-matching portion by comparing the character on the control sequence with the character on the corresponding control sequence and the partial sequence portion in the target sequence data is included. It is possible to determine that there is a polymorphism when a non-matching part is detected, and to determine that there is no polymorphism when a non-matching part is not detected. In this case, the comparison does not necessarily start from the partial sequence, and comparison can be made with the entire sequence (for example, short read sequence) of the target data.
  • the positional relationship between partial sequences in the target sequence data and the positional relationship between partial sequences on the control sequence are different, it can be determined that there is a polymorphism of interest.
  • the polymorphisms detected include, but are not limited to, insertions, deletions, inversions, or translocations.
  • determining that there is a target polymorphism the following: When a partial sequence is present on a different sequence structure of a control sequence, it is determined that a translocation has occurred. The partial sequence is present on the same sequence structure of a control sequence, and the orientation is a target sequence. If it is different from that on the data, it is determined that there is an inversion.
  • a partial sequence is present on the same sequence structure of the control sequence, the orientation is identical to that on the target sequence data, and the partial sequence is If the distance of the target sequence is shorter than the distance on the target sequence data on the control sequence, it is determined that a deletion exists.
  • a partial sequence exists on the same sequence structure of the control sequence, and the direction is the target sequence data. 1 of the determination that an insertion is present if the partial sequence target sequence distance is longer than the target sequence data distance on the control sequence It is possible to perform more. In addition, or in place of these, substitution does not occur if a position where a letter between the partial sequence sites in the target sequence data does not match a letter on the corresponding control sequence is detected if the positional relationship is not different. It can be determined.
  • a method of determining the position of a target sequence on a control sequence comprising: a) outputting positions in the sequence and control sequence for a plurality of k-length subsequences in the control sequence B) comparing the sequences obtained in c) a) and b) with the step of b) outputting the sequence and the position in the target sequence for a plurality of partial sequences of k length in the target sequence; There is provided a method comprising the step of correlating the position in the control sequence corresponding to the sequence with the position in the subject sequence, where k is a length not exceeding the length of the subject sequence.
  • Such methods may be utilized in the polymorphism detection methods described herein.
  • the mapping method can be performed at high speed as demonstrated in Example B1 of the present specification, and is useful when the data of the control sequence is large (for example, reference genomic sequence).
  • any value up to the length of the target sequence can be mentioned, for example, about 500, about 400, about 300, about 200 , About 50, about 40, about 30, about 25, about 20, about 15, and the like.
  • the data of k-mer sequence increases exponentially (for example, in the case of a base sequence, the combination of bases is quadrupled for each increase of k by one base).
  • k is preferably about 5 to 30, and more preferably k is about 15 to 22.
  • the position of the subject sequence on the control sequence may be determined by the method described above, and the subject sequence data and the control sequence may be aligned.
  • the control sequence is aligned so that the target sequence data matches the position of the first partial sequence of the target sequence, and the target sequence data matches the position of the second partial sequence of the target sequence.
  • Control sequences can be aligned.
  • the result of the alignment may be displayed. For example, the control sequence is displayed above the subject sequence data so that the position of the first partial sequence of the subject sequence matches, and the subject below the subject sequence data. Control sequences may be displayed such that the positions of the second subsequences of the sequences match. Based on the alignment, it is possible to detect polymorphisms in the target sequence data relative to control sequence data.
  • the polymorphism detection / sequence comparison method of the present invention can expand the analysis range by expanding the range in which the partial sequence (for example, both ends) of the target sequence (for example, read) can be mapped. Since alignment can not be made when one partial sequence is a repeat region, it is preferable that the positions of both ends of the lead be specified. The 3 'end region is more likely to be in error and, while usable, is less efficient to map. For this, for example, the detection efficiency can be improved by mapping by shifting a plurality of positions, such as positions 5 bases, 10 bases, and 15 bases inside the read.
  • the read is also mapped to a certain extent in the repeat region, but in the above mapping method, a k-mer at a unique position can be selected at the time of creating the k-mer position data, whereby at least k Non-unique parts in the range of -mer are excluded from mapping. This can reduce the possibility of detecting false positives as compared to bwa.
  • the method of the present invention compares the character between partial sequence sites in the target sequence data with the character on the corresponding control sequence when it is determined that there is a polymorphism of interest. Inclusion is one feature.
  • the step of comparing may be a step of sequentially comparing the characters between the partial sequence sites in the target sequence data with the characters on the corresponding control sequence and the partial sequence site as a start point to detect a non-matching site.
  • the method of the present invention sequentially compares characters on partial sequence sites in the target sequence data with characters on the corresponding control sequence, starting from partial sequence sites as a starting point to detect non-matching sites. It is characterized by In this step, for example, searching for a matching character upstream until a character that does not match the character at the corresponding position in the control sequence is detected from the portion of the partial sequence downstream in the target sequence data; And / or searching for a matching character downstream from the partial sequence upstream in the target sequence data until a character not matching the character at the corresponding position in the control sequence is detected.
  • searching for matching characters from both upstream and downstream is included.
  • the search for matching characters may be performed by a fixed number of characters, for example, a search of 1 to 3 characters can be performed, and preferably a search of 1 character is performed.
  • the search may end there.
  • the method of the present invention may further include the step of searching for a match of the preceding characters if a non-matching character is detected.
  • the detected non-matching character is identified as the boundary of polymorphism, Otherwise, the non-matching characters can be ignored and the search for matching characters can continue.
  • the non-matching characters detected are It is possible to identify the border of the type and, otherwise, ignore the non-matching characters and continue searching for matching characters.
  • the detected non-matching character can be identified as a boundary portion of polymorphism.
  • the character adjacent to the partial sequence may be a mismatch, but if this is detected as a polymorphism, false detection may occur. possible. Therefore, a certain number of characters (for example, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 characters, etc.) from the matching portion with the partial sequence of the target sequence match, and there is a mismatch It is possible that the boundary of polymorphism is detected only when a part is detected. As a result, the case where the characters adjacent to the portion where the partial sequence on the target sequence matches on the control sequence does not match is not detected.
  • the reference sequence may be the sequence of the corresponding portion of the reference sequence that matched upstream with the partial sequence, above the sequence of the target sequence.
  • the sequence of the corresponding portion of the reference sequence that matches downstream with the partial sequence can be placed at the bottom so that the k-mer portion matches, and can be used as the comparison starting point.
  • non-matching character when searching for a non-matching character as a non-matching character for whether a character preceded by a non-matching character matches, a case where a certain proportion or more of the characters in a certain range do not match is detected.
  • the non-matching characters can be used.
  • the identifier and position of the array structure can be output as the non-matching character as the boundary character of the inserted / deleted sequence.
  • a method for identifying a mutation in a subject sequence suspected of having a mutation relative to a control sequence.
  • the polymorphism (mutation) may be confirmed using the method.
  • Methods to confirm mutations are available to confirm the presence of substitutions, insertions, deletions, inversions and / or translocations.
  • a partial sequence is cut out so as to include a mutation site at the L length of the target sequence from the reference sequence, and a set substituted with mutations and a set not substituted are created, sorted and output together with descriptions of positional relationship, presence or absence of mutation, etc.
  • This data and the target sequence sorted by this process are processed (or appropriate equivalent processing) with the unix command to select wild-type and mutant-type sequences contained in the target sequence, and the number of sequences per mutation site Examine. After sorting the selected array, you can count the number of arrays with the uniq -c command. The same operation is performed on the control sequence (L 'length). When there is a control individual for the target individual, the lead sequence obtained from this individual can be used as a control sequence.
  • control sequence created by cutting L length from the reference sequence as a control sequence. If the target and control sequences differ in length, one can generate data sets of mutant and wild-type corresponding to the respective lengths and find out the corresponding numbers.
  • the method comprises the steps of: a) providing a set of L-length subsequence data of the subject sequence and a set of L'-length subsequence data of the control sequence; b) suspected of a mutation in the reference sequence A plurality of partial sequences including a part, positional information of the partial sequences, information on substitution, insertion, deletion, inversion and / or translocation, L length of the target sequence, or L 'length sequence of the control sequence Providing a set of an indication of which of the above and an indication of being free from mutations, where L and L ′ are different, a set comprising a plurality of L-length partial sequences, And c) a plurality of partial sequences including a portion obtained by converting a portion suspected to have a mutation in the reference sequence into a mutated character.
  • L ' is an integer equal to or less than the total length of the control sequence.
  • L ⁇ L ' for example, when the control sequence is a short read from a control individual, a partial sequence data set of the length of the target sequence (L) and the short read length of the control (L') Each can be produced and the number of matches can be measured separately.
  • the L'-length partial sequence data set of the control sequence can use sequencing data (L'-length) from the "control individual", or an L'-length partial sequence (cut out from the reference sequence (In this case, although it can be set freely, it is preferable to use the same L length as the target sequence.
  • L or L ′ may be, for example but not limited to, the same as the length of the subject sequence, for example, the short read length of the next-generation sequencer (eg, about 500, about 400, about 300, about 200) , About 50, about 40, about 30, about 25, about 20, or about 15 etc.).
  • L or L ' is preferably about 50 to about 200, and in one example about 100.
  • Array As the target sequence, control sequence and / or reference sequence of the present invention, any sequence that can generate polymorphism can be used. Note that a reference sequence can be used as a control sequence.
  • the subject sequence, control sequence and / or reference sequence is a biological sequence, eg, a base sequence (including sequences such as DNA, RNA and their analogs), an amino acid sequence Or a sugar chain sequence or the like. Examples of biological sequences include, for example, genomic sequences, chromosomal sequences, gene sequences, plasmid sequences, exon sequences, protein sequences and the like.
  • the target sequence data and the control sequence data are not limited, but in order to detect polymorphism, it is desirable that they be sequence data for sequences having a certain commonality.
  • the method of obtaining the sequences may be identical or different, and it is possible to compare the data obtained by sequencing or to compare the data obtained from a database or the like. It is also possible to make comparisons between data obtained by Sing and data obtained from a database or the like.
  • the subject sequence data is sequence data obtained from an individual, and the control sequence data is sequence data obtained from another individual homologous to the individual or from a database.
  • the subject sequence data is sequence data obtained from a tissue sample of an individual and the control sequence data is sequence data obtained from another tissue or database of the individual.
  • the subject sequence data is sequence data obtained from a cell sample and the control sequence data is sequence data obtained from another cell or database.
  • the target sequence data and / or control sequence data used in the method of the present invention is nucleotide sequence data obtained by sequencing.
  • Sequencing methods include the Sanger method, Maxam-Gilbard method, single molecule real time sequencing (eg, Pacific Biosciences, Menlo Park, California), ion semiconductor sequencing (eg, Ion Torrent, South San Francisco, California), Pyrosequencing (eg, 454, Branford, Connecticut), sequencing by ligation (eg, SOLiD sequencing of Life Technologies, Carlsbad, California), sequencing by synthetic and reversible terminators (eg, Illumina, San Diego, Californi) ), Nucleic imaging techniques such as transmission electron microscopy, and the like nanopore sequencing.
  • the subject sequence data and / or control sequence data used in the methods of the invention may be sequence data obtained by next-generation sequencing.
  • Next generation sequencing includes sequencing bi synthesis, pyro sequencing, sequencing by ligation, ion semiconductor sequencing, nanopore sequencing.
  • the accuracy is limited by mapping to a reference or assembly, and it is considered that great benefits can be obtained when using the method of the present invention.
  • the subject sequence data and / or control sequence data used in the method of the present invention is a dinitrophenylation method, a hydrazinolysis method, a carboxypeptidase method, an Edman method or an apparatus for automating those methods (peptide sequencer or Amino acid sequence data obtained from a method using a protein sequencer, a method using mass spectrometry (for example, tandem mass spectrometry (MS / MS)) (for example, sequence tag method), and the like.
  • mass spectrometry for example, tandem mass spectrometry (MS / MS)
  • sequence tag method for example, sequence tag method
  • target sequence data and / or control sequence data of the present invention are derived is not limited as long as it has biological sequences.
  • animals include humans or non-human mammals (eg, mice, rats, rabbits, sheep, pigs, cattle, horses, cats, dogs, monkeys, chimpanzees), birds, reptiles, amphibians, fish, etc. Vertebrates and invertebrates, for example, insects, linear animals and the like.
  • plants include rice, wheat, corn, potato, barley, sweet potato, buckwheat, Arabidopsis thaliana, Lotus pea, tomato, cucumber, cabbage, Chinese cabbage, eggplant, sugar cane, sorghum, apple, orange, banana, peach, poplar, pine, cedar, Angiosperms, gymnosperms, ferns, mosses, algae and the like can be mentioned.
  • fungi, bacteria, viruses and the like may be used.
  • target sequence data and / or control sequence data derived from parts of these organisms such as tissues, cells, etc. and detect polymorphisms.
  • a program for causing a computer to execute a method of detecting a polymorphism in control sequence data in target sequence data comprising a) storing target sequence data and control sequence data in a computer; b) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; c) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; d) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are different, it is determined that there is a target polymorphism, and the target sequence data in the target sequence data is determined
  • a program is provided which includes the steps of: sequentially comparing characters on corresponding control sequences with characters between partial sequence sites, and sequentially comparing the partial sequence sites as a starting point to detect mismatched sites.
  • the program may be written in any language.
  • a recording medium storing a program for causing a computer to execute a method of detecting polymorphism in control sequence data in target sequence data, the method comprising: a) storing target sequence data and control sequence data in a computer; b) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; c) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; d) When the positional relationship between the partial sequences in the target sequence data and the positional relationship between the partial sequences on the control sequence are not different, it is determined that there is no polymorphism of interest, and the positional relationships are different.
  • the recording medium including the step of detecting a site.
  • the program may be written in any language.
  • the recording medium may be an external storage device such as a ROM, HDD, magnetic disk, flash memory such as USB memory, etc., which may be stored internally.
  • a system for detecting polymorphism in control sequence data in target sequence data comprising: A sequence data providing unit configured to provide a computer with target sequence data and control sequence data; a) identifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; c) When the positional relationship between the partial sequences in the target sequence data does not differ from the positional relationship between the partial sequences on the control sequence, it is determined that there is no polymorphism of interest, and the positional relationships are different.
  • a sequence data calculator configured to perform the steps of detecting a site.
  • FIG. 5A shows the case where it is realized by a single system, it is understood that the case where it is realized by a plurality of systems is also included in the scope of the present invention.
  • a system 1000 includes a CPU 1001 incorporated in a computer system and a RAM 1003 via a system bus 1020, an external storage device 1005 such as a ROM, a HDD, a flash disk such as a magnetic disk or USB memory, and an input / output interface (I / F ) 1025 are connected.
  • An input device 1009 such as a keyboard and a mouse, an output device 1007 such as a display, and a communication device 1011 such as a modem are connected to the input / output I / F 1025, respectively.
  • the external storage device 1005 includes an information database storage unit 1030 and a program storage unit 1040. Each is a fixed storage area secured in the external storage device 1005.
  • this storage device is input by inputting various commands (commands) via the input device 1009 or by receiving a command via the communication I / F, the communication device 1011 or the like.
  • a software program installed in 1005 is called, expanded, and executed on the RAM 1003 by the CPU 1001 to detect polymorphisms in the target array data of the present invention in cooperation with the OS (operation system). It is supposed to play the function of the method. Of course, it is possible to implement the present invention by a mechanism other than such cooperation.
  • the target sequence data, at least two portions in the target sequence data array Array data and / or control array data may be input via the input device 1009, or may be input via the communication I / F, communication device 1011 or the like, or may be stored in the database storage unit 1030. It may be The identified position data may be output through the output device 1007 or may be stored in the external storage device 1005 such as the information database storage unit 1030.
  • the process stored in the program storage unit 1040 or the input device 1009 is performed. It can be executed by the software program installed in the external storage device 1005 by inputting various commands (commands) or by receiving a command via the communication I / F or the communication device 1011 or the like. .
  • the comparison result may be output through the output device 1007 or may be stored in the external storage device 1005 such as the information database storage unit 1030.
  • the above calculation results are stored in the database storage unit 1030 in association with information on the sequence, for example, biological information, biochemical information, medical information, for example, known information such as a disease, a disorder, or biological information. It is also good.
  • association may be made as data available through the network (Internet, intranet, etc.) as a link of the network as it is.
  • the computer program stored in the program storage unit 1040 may be a computer, for example, providing sequence data, providing partial sequence subsets, calculating position data, comparing position data, detecting polymorphism, etc.
  • the system is configured as a system that performs processing such as confirmation of polymorphism.
  • Each of these functions is an independent computer program, its module, routine, or the like, and is executed by the CPU 1001 to configure the computer as each system or apparatus.
  • each function in each system cooperates to constitute each system, but the program for this processing is also an external storage device or a communication device, respectively. It may be provided via an input device.
  • provision of target sequence data and / or control sequence data, data of subsets of their length k, and / or their position data is also collectively performed as a sequence data provider. Good. Further, the comparison of positional relationship and the detection of polymorphism may be summarized as an array data calculation unit.
  • the method of the present invention may be implemented by a computing system having a cluster structure.
  • the system is in a cluster configuration, consisting of heads and nodes.
  • the node can use an SSD for the main storage to speed up the search.
  • one head can be operated by a plurality of nodes (for example, 12).
  • the computing system has a cluster structure and mounts a mass storage device (HDD) on a main computer (cluster head) to store analysis data and results. From the cluster head, the divided data is sent to each node and calculation is performed, and the result is aggregated into the cluster head.
  • HDD mass storage device
  • Both the cluster head and the nodes can be equipped with a central control element (CPU) and a memory (RAM), and data can be communicated via a communication interface (NIC).
  • a solid state drive (SSD) can be used as a main storage device for high speed search processing of nodes.
  • the CPU, RAM, SSD and the like mounted on each node may be shared with other nodes or physically separated.
  • a useful process for detecting substitutions, copy number variation, STRP, insertions, deletions, inversions, or translocations is the frequency of occurrence of each subsequence of the subset of subsequences of length k of the subject sequence data. And the frequency of occurrence of each partial sequence of the subset of the partial sequence of length k of the control sequence data (where k is an integer less than or equal to the entire length of the target sequence and the shorter one of the control sequences), There is a process including the step of detecting polymorphism based on the comparison of distribution of occurrence frequency. By such a step, polymorphisms can be detected by comparing sequence data without considering positions in the full-length sequence and without linking the sequences.
  • the process calculates the distribution of the frequency of occurrence for the portion of length x for each sequence common to the sequence portion of length k-x (x is a positive integer less than k) in the partial array, and the occurrence frequency
  • the comparison of the distribution of the frequency of occurrence of the part of length x in the sequences common to the sequence parts of length kx in the partial sequences can be included as comparison of the distribution of.
  • the method of the present invention comprises the step of grouping sequence portions of length kx in the partial sequence into unique sequences. This may include, for example, sorting the length kx array portion (eg, sorting the length kx array portion as a character string).
  • the value of k is a length that excludes accidental identity in the subject sequence data or the like.
  • the length x is not limited, but is preferably 1 to 3, more preferably 1 to 2, and more preferably 1.
  • the portion of length x is present at the end of the partial sequence.
  • the following polymorphisms can be detected by comparing the difference in the distribution of the appearance frequency.
  • the sequence of the portion of length x when the appearance frequency of the sequence of the portion of length x differs between the subset of control sequence data and the subset of the target sequence data, the sequence of the portion of length x is targeted It detects as polymorphism with respect to control sequence data in sequence data.
  • a sequence portion of length kx in which the most frequent ones of the sequences of the portion of length x differ between the subset of control sequence data and the subset of target sequence data If there is, the sequence of the portion of length x is detected as a polymorphism in the subject sequence data.
  • the process may further include the step of confirming the detected polymorphism.
  • the confirmation can be performed, for example, by comparing the target sequence data and / or the control sequence data, using the query sequence set generated from the reference sequence or the control sequence, for the detected polymorphic site.
  • the query sequence set is a variant query sequence set in which the character of the site corresponding to the polymorphism is replaced with a different character in the reference sequence or control sequence, and / or the character of the site corresponding to the polymorphism in the reference sequence or control sequence May comprise a wild-type query sequence set without replacement of
  • the process is carried out with reference to the sequence data of the complementary strand of the target sequence data and / or the control sequence data, for the detected polymorphic site. It may further include the step of comparing and confirming with a set of query sequences generated from the sequences.
  • the method of the present invention refers to the sequence data of the alleles of the target sequence data and / or the control sequence data with respect to the detected polymorphism site. It may further include the step of comparing and confirming with the set of query sequences generated from the sequence or control sequence.
  • a method for detecting polymorphism in control sequence data in target sequence data comprising: (1) a) providing the frequency of appearance of each partial sequence of a subset of the length k subsequences of the target sequence data; b) providing the frequency of occurrence of each subsequence of the subset of subsequences of length k of the control sequence data; c) comparing the target sequence with the control sequence and detecting polymorphism based on the comparison of the frequency of occurrence, substitution, copy number polymorphism, STRP, insertion, deletion in the target sequence data; A process of detecting inversion or translocation; (2) a) specifying the position on the control sequence of at least two partial sequences in the sequence of the target sequence data; b) comparing the positional relationship between the partial sequences in the target sequence data with the positional relationship between the partial sequences on the control sequence; c) When the positional relationship between the partial sequences in the target sequence data and
  • Insertion, deletion, inversion in the target sequence data by the step of comparing the characters between the partial sequence sites with the characters on the corresponding control sequence and sequentially comparing them starting from the partial sequence site and detecting the mismatched sites. , A process of detecting a translocation or substitution is provided.
  • a method for detecting polymorphisms in reference sequence data in target sequence data comprising the step of creating, from the reference sequence data, a k-length partial array set of reference sequences associated with each position information, (A1) generating a subset of a subsequence of length k of the target sequence data, and providing a frequency of appearance of a unique subsequence of length k; (A2) providing the frequency of appearance of unique length k subsequences of the k-length subsequence set of the reference sequence; (A3) comparing the target sequence with the reference sequence, and detecting insertion, deletion, substitution, copy number variation, STRP, inversion or translocation based on the comparison of the distribution of the frequency of occurrence; (B1) a search is performed on the k-length subsequence set of the reference sequence by using at least two k-length subsequences in the sequence of the target sequence data as a query; Specifying the position on the
  • the method further includes the step of detecting a non-matching portion by comparing characters on the partial sequence site in the target sequence data with characters on the corresponding control sequence, when the positional relationship is not different,
  • a process further comprising the step of determining that a substitution exists if there is a non-coincidence site; Are provided simultaneously, in parallel, or sequentially.
  • the polymorphism detection of the present invention should be used for detection of microsatellites. it can.
  • the method of the present invention can also be used in detection of gene disruption in genome editing (eg, CRISPR / Cas9, ZFN, TALEN, etc.), detection of off-target alteration (eg, variation of SSR), and the like.
  • the method of the present invention can also be used to detect somatic mutations in cultured cells such as iPS cells, cancer cells, etc., and is considered useful for manipulation and / or monitoring mutations due to excessive cell proliferation.
  • Example 1 Detection of polymorphism in rice reference genome
  • Example 2 Detection of polymorphism in rice reference genome
  • Polymorphism detection k-mer is obtained from the 5 bases inner base from both ends of each short read sequence that has been subjected to sort_uniq processing to sequence data from 3M1 fastq data, and binary search is performed on reference genome data, It was judged whether both sides hit a unique position. In the case where both sides or one side did not hit a unique position, a binary search was further performed with a k-mer from the base inside 10 bases, and it was determined whether both sides were hit at unique positions. In the case where both sides or one side did not hit a unique position, a binary search was performed with the k-mer from the base inside 15 bases, and it was determined whether both sides were hit at unique positions.
  • the corresponding reference sequence is acquired from the hit upstream and downstream k-mer position information, and the upstream side is above the short read sequence and the downstream side is below k. Arranged so that the parts of-mer coincide, and used as the starting point of comparison. The reference sequence was compared with the upper and lower sequences in order to find out the unmatched bases.
  • the chromosome number and position were output as the border base of the insertion / deletion sequence for the non-matching base.
  • the first line starting with # in the figure is the chromosome number and boundary position (junction) position when matching from the upstream side followed by #, the chromosome number and junction position when matching from the downstream side, and the last inserted The size of the deletion.
  • the second line shows, from the left, the partial sequence (primer) used as the starting point for upstream matching, the partial sequence downstream (primer), and the last the distance from the end of the base sequence of the next-generation sequencer to the primer There is.
  • the reference sequence was placed up and down based on the primer sequence, and a portion (the end point of the arrow) not matching was found.
  • the position of the end point is a junction.
  • the insertion / deletion site is a repetitive sequence, the upstream and downstream junctions overlap each other. In this example, it has been detected that the repeat sequence of 2 bases of AT causes four deletions (for 8 bases). (SEQ ID NOS: 5 to 8 from above)
  • a 280 kb deletion was detected that spans from 23388732 to 23668838 on chromosome 8.
  • the position of the underlined 20-base sequence on the genome is a 20-base sequence cut out while shifting the reference genome by 1 base from the end, and the data describing the sequence and the cut-out chromosome number, position and orientation in a row is It is determined by acquiring a chromosome number, a position, and an orientation by a binary search algorithm using a data set sorted in a lexicographic order of 20 bases. (SEQ ID NOS: 9 to 12 from above)
  • Example 2 Evaluation of Polymorph Detection Performance In order to evaluate the polymorphism detection performance by the method of the present invention, the sequence data of the rice genome mimicking the data derived from the next-generation sequencer by introducing a mutation to the rice reference sequence (IRGSP1.0) is used as target sequence data. Type detection was performed.
  • the target sequence is a deletion of 1 base at 3 Mbp from upstream, a insertion of 1 base at 6 Mbp from upstream, and a deletion of 100 kb at 9 Mbp from upstream on 12 chromosomes of rice respectively
  • the reference sequence introduced was used.
  • one base substitution mutation was introduced at every 10 Mb from the upstream side of each chromosome.
  • the sequence set is obtained by skipping one base at the position on the genome from the 100 base long sequence (equivalent to 50 times the genome), and mimics the sequence data from the next generation sequencer. did.
  • the noise of substitution mutation was introduced with a probability of 0.1%.
  • the k-mer was obtained from the bases at 5 bases from both ends of each short read sequence subjected to sort_uniq treatment, and binary search was performed on the reference genome data to determine whether both sides hit unique positions. In the case where both sides or one side did not hit a unique position, a binary search was further performed with k-mers from bases 10 bases internal to both ends, and it was determined whether both sides were hit at unique positions. In the case where both sides or one side did not hit a unique position, a binary search was performed with a k-mer from bases 15 bases behind both ends, and it was determined whether both sides were hit at unique positions. Here, if both sides or one side did not hit a unique position, analysis of the target short read sequence was abandoned, and the same search was performed for the next short read sequence.
  • the corresponding reference sequence is acquired from the hit upstream and downstream k-mer position information, and the upstream side is above the short read sequence and the downstream side is below k. Arranged so that the parts of-mer coincide, and used as the starting point of comparison. The reference sequence was compared with the upper and lower sequences in order to find out the unmatched bases.
  • Chr is a chromosome number
  • Top is a top strand (sequence of 5 'to 3' of the base sequence)
  • Bottom is an insertion / deletion junction of bottom strand (complementary strand)
  • Size is a size of insertion / deletion (deletion is Negative notation)
  • Reads is the number of next generation sequencer reads detected at similar positions and sizes.
  • Example 3 Further examination of algorithm, comparison with conventional method
  • Step 0 was performed only once in preparation, and steps 1 to 5 were performed for each sample. 0.
  • Samtools detected 19 mutations introduced into the reference genome. In the method devised this time, 22 locations were detected. A comparison of the detection results of Samtools and the method devised this time is shown in FIG. The results with Samtools are shown in FIG.
  • Example 2 The false detection of the chromosome 4 in Example 2 did not occur because, in Example 2, the immediately following base to which the k-mer was matched was detected even if there is a mismatch. It is thought that it is because it picked up only the perfect match up to 5 bases ahead of the position.
  • FIG. 1 A comparison table summarizing the detection results of the method of the present invention and the method using Samtools is shown in FIG.
  • Samtools has not been able to detect any mutation at position 900001 of each chromosome, ie a 100 kb deletion. It is considered that, for deletion exceeding the read length, detection by the conventional method of bwa and samtools becomes impossible in principle.
  • next-generation sequence sequence data NA18507 of human genome was downloaded and used.
  • the sequence data was analyzed by a next-generation sequencer manufactured by Illumina, and registered and published in NCBI. The data was downloaded and used.
  • the URL of the experiment ID of the base sequence set was https://www.ncbi.nlm.nih.gov/sra/SRX016231 and the accession number of the sequence was in the range of SRR034939 to SRR034975.
  • inversion and translocation it is considered that they may not be viable if they occur at such a frequency, and therefore, they may be an artifact during sample preparation. However, since this DNA sample seems to be obtained from cultured cells, it is possible that it may actually occur during long-term culture.
  • sequence data origin There were two types of samples of sequence data origin, and the sequence data names and sample contents were as follows.
  • SRR2096532 Control Blood (Normal) SRR 2096535 Follicular lymphoma (9690/3: Follicular lymphoma) Number of reads (sequence length 101 bases)
  • SRR 2096532 1300353764 SRR 2096535 1339310760 Number of arrays of sort_uniq SRR 2096532 2056683322
  • SRR 2096532 normal tissue
  • SRR 2096535 tumor tissue
  • human genome reference hg38 was used as control sequence data.
  • sequences the chromosomal data of chr1 to chr22 and chrX, chrY and chrM were downloaded and used from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/. Data with comments on file names such as alt and v1 is excluded.
  • the detection result itself is visually checked as an alignment. It is one of the features that can be confirmed.
  • Example A1 Display Method for Polymorphism Detection (Overview) Using the method of the present invention, control sequences that match both ends of the target sequence data or their adjacent portions are placed above and below the target sequence data, respectively, and the portions where the target sequence and control sequence do not match are displayed. Demonstrate what you can do. It also shows that such a display is useful in the detection of polymorphisms.
  • mother sequence (ERR194147) of CEPH 1463 Family which is next-generation sequencing data present on the database, was used.
  • a set of sequences of 100 bases in length obtained from the rice reference sequence into which mutations were introduced was used as a set of short read sequences.
  • k-mers are obtained from bases located 5 bases from both ends of each short read sequence subjected to sort_uniq treatment, and are mapped to reference genome data (for details, see Example B1). ) was done, and it was judged whether both sides were hit at unique positions.
  • the sequence of the reference genome is displayed on the upper side of the target sequence data so that the position of the 5'-side k-mer on the short read sequence matches, and on the lower side of the target sequence data, the 3 'side on the short read sequence
  • the sequence of the reference genome is displayed so that the positions of the k-mer of the are identical.
  • the display will be output.
  • single base insertions and deletions can be detected in the poly A region where mutations are difficult to detect.
  • mutations that are usually difficult to detect can also be detected visually.
  • the conventional method for example, it is described that the last A of poly A is deleted, but in actuality, it is not known which of a large number of A is deleted.
  • the deletion has occurred in any part of the junction, although it is not known which A.
  • the expression of mutations by indicating the positions of aligned junctions of both normal and complementary strands is novel and useful as a method of expressing mutations per se. Also useful is an indication by indicating the position of the junction (notation of a line beginning with # in the above example), not necessarily indicating the alignment.
  • Example B1 Mapping Method (Overview)
  • a method (mapping method) for rapidly determining the position on the genome of both ends of the target sequence or the vicinity thereof is demonstrated.
  • the array data including the output partial array and position information is arranged in dictionary order. If the same partial array appears in multiple lines, the array will be multiple on the reference array, and it will not be an array to be determined as a single position, so it will be discarded and a reference consisting of only unique arrays Create position information data of partial array. As an example, it is aligned as follows: Reference genome partial sequence data (part) arranged in lexical order (SEQ ID NOS: 40 to 50 respectively from the top)
  • Arbitrary 20 bases are obtained from the target sequence and its complementary strand sequence data, and the obtained 20 bases, the target sequence, and the origin in the target sequence of 20 bases are arranged in this order and output in one line.
  • the target sequence data aligned with the reference partial sequence position information data is read out, and when both of the 20 base partial sequences match, both data are combined and output in one line.
  • the position on the genome of the target sequence can be determined from the position on the genome of the reference sequence and the start position on the target sequence of 20 bases obtained from the target sequence.
  • Reference partial array position information data file as reference
  • target array data file as target
  • unix command join reference target Run and get results.
  • the target file 20 bases from the sixth base of the next-generation sequencer's sequence data (second column) is placed in the first column, and the fact that it has been excised from the sixth base is described in the third column.
  • the staining numbers, positions, and orientations are associated with the fourth and subsequent columns, and in this case, the position of the sixth base of the sequence data is known.
  • Insertion, deletion, translocation, inversion, and substitution mutations can be detected by using this result to detect inconsistencies from each position in comparison with the reference genome.
  • the hardware used for the benchmark had an Intel Celeron CPU G1840@2.80 GHz, 8 GB RAM and 1 TB SSD in working directory.
  • an additional HDD was used for the primary directory.
  • the time to map all sort_uniq data (2,449, 630, 776 reads) of ERR194147 is shown.
  • 10 leads and 10,000,000 reads were analyzed to estimate the overall time, respectively.
  • For bwa the first 10,000,000 reads from paired fastq files were used to estimate the total time.
  • Example B2 Example of Variation of Mapping Method (Overview)
  • an example of a variation of the method (mapping method) of rapidly determining the positions on the genome of both ends of the target sequence or the vicinity thereof is demonstrated.
  • a 19-mer sequence is cut out from each site of the genome reference sequence, and output in one line in the order of 19-mer sequence, chromosome number, position and direction, and the file ref sorted in alphabetical order is used for mapping. (SEQ ID NOS: 81 to 90 respectively from above)
  • a sequence of 20 bases in length was cut out from each site of read data of 100 bases each, and a 20 base sequence was repeatedly output (k-mer_file) until reaching the 3 'end of the target base sequence.
  • the outputted 20-base sequences were sorted in dictionary order, and the same sequences were put together into one, and a file in which the number of occurrences was written together with the sequences was created.
  • a sequence of 19 bases was obtained from the 5 'end of the sequence, and the 3' end base, i.e., the k-th base was converted into data represented as the number of appearances of A, C, G, T. It is output in the form of "the number of times of the sequence A of 19 bases long, the number of times of C, the number of times of G, the number of times of T".
  • Example B1 the position on the reference sequence of each 19 base long sequence of the target sequence was derived from the 19 base long data of the reference sequence and the data of the 19 base long sequence of the target sequence.
  • the chromosome number, position, and orientation are output after the sequence followed by the reference sequence and the 20th base frequency of the target sequence.
  • AAAGCAAATTTATTTGTTT starts from 144844205 of chromosome 2
  • the polymorphism of the target sequence is a polymorphism of G and T heterotypes.
  • Example C1 Method of Confirming Mutation (Overview) This example demonstrates an example of how to confirm the presence of a mutation detected by the methods described elsewhere herein.
  • ERR194147 was used as object sequence data.
  • a G to T mutation at a site 916010 of chromosome 1 is detected by the bidirectional alignment method (Example A1).
  • the target sequence and its complementary strand sequence data are arranged in dictionary order, and the same row is output only one row, and the sorted single row data file (sort_uniq file), sorted variants and wild type
  • the map data file of is compared in order, and only the data in which the target array exists is output. With unix command It can be realized by the data for join data map.
  • a reference sequence and how many wild-type and mutant-type data were detected at each position relative to the reference sequence, respectively, are summarized.
  • the wild type is the majority in the reference sequence
  • the variant in the case of the homozygous mutation, is the majority in the target sequence.
  • the distribution of the wild type and the variant forms in the target sequence is half.
  • a reference genome sequence set of the same length as the target sequence including the sequence between the upstream and downstream junctions in the notation of Example A1 is prepared, 1. To 3. The same confirmation can be made by performing the mapping operation of
  • the underlined base is the target base, a sequence extracted from the reference to the same length as the target base sequence (target), chromosome number, position, wild type base, mutant type base, wild type tw, mutant type tm And output.
  • the target base of the excised sequence is a mutant type. (Part of the data set for confirmation of G to T mutation at the 916010 site of chromosome 1) (SEQ ID NOS: 98 to 109 from the top, respectively)
  • the value of chromosome number, position, wild type base, mutant base, tw (wild type), tm (mutant) is extracted from the data output in the previous step, and the number of appearances (left end) I examined. The number of occurrences was obtained by executing the Unix command uniq -c after sorting the data.
  • the chromosome number, position, reference base, mutant base, detection number at search number of wild type detection for reference sequence (reference), number of mutant detection, number of wild type detection for target sequence, number of mutant detection .
  • reference sequence mostly wild type, heterotype (H) when the wild type and mutant are half of the target sequence, and homotype (M) when the mutant is mostly, are displayed at the right end. ing.
  • SEQ ID NOs: 1 to 16 Rice nucleotide sequence used in Example 1
  • SEQ ID NOs: 17 to 24 Human base sequences used in Example 5
  • SEQ ID NOs: 25 to 39 Human base sequence number used in Example A1
  • 40-80 Human base sequence used in Example B1
  • SEQ ID NO: 81-97 Human base sequence used in Example B2
  • SEQ ID NO: 98-118 Human base sequence used in Example C1

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Sustainable Development (AREA)
  • Plant Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne une méthode de détection de polymorphisme entre deux séquences ou plus. Une caractéristique de cette méthode est que, sans nécessiter que des séquences plus longues soient formées (par exemple, par assemblage) en joignant des séquences individuelles (par exemple, des lectures courtes à partir d'un séquenceur de prochaine génération) dans des données de séquence, un polymorphisme peut être détecté. Dans un mode de réalisation, cette méthode est caractérisée en ce que, à partir d'une séquence partielle dans une séquence cible correspondant à une séquence (par exemple, un génome de référence), une comparaison d'une séquence cible et d'une référence est effectuée avec et étendue jusqu'à ce qu'un défaut d'appariement se produise, et de cette manière une jonction de mutation est déterminée.
PCT/JP2018/027536 2017-07-24 2018-07-23 Méthode de détection d'insertion, de délétion, d'inversion, de translocation et de substitution WO2019022019A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019532604A JP7122006B2 (ja) 2017-07-24 2018-07-23 挿入・欠失・逆位・転座・置換検出法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-142782 2017-07-24
JP2017142782 2017-07-24

Publications (1)

Publication Number Publication Date
WO2019022019A1 true WO2019022019A1 (fr) 2019-01-31

Family

ID=65039676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/027536 WO2019022019A1 (fr) 2017-07-24 2018-07-23 Méthode de détection d'insertion, de délétion, d'inversion, de translocation et de substitution

Country Status (3)

Country Link
JP (1) JP7122006B2 (fr)
TW (1) TW201921277A (fr)
WO (1) WO2019022019A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009057757A1 (fr) * 2007-10-31 2009-05-07 National Institute Of Agrobiological Sciences Programme de détermination de séquence de bases, système de détermination de séquence de bases et procédé de détermination de séquence de bases
WO2014132497A1 (fr) * 2013-02-28 2014-09-04 株式会社日立ハイテクノロジーズ Dispositif et procédé d'analyse de données
JP2016103999A (ja) * 2014-11-05 2016-06-09 アジレント・テクノロジーズ・インクAgilent Technologies, Inc. ゲノム位置に標的濃縮配列リードを割り当てるための方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009057757A1 (fr) * 2007-10-31 2009-05-07 National Institute Of Agrobiological Sciences Programme de détermination de séquence de bases, système de détermination de séquence de bases et procédé de détermination de séquence de bases
WO2014132497A1 (fr) * 2013-02-28 2014-09-04 株式会社日立ハイテクノロジーズ Dispositif et procédé d'analyse de données
JP2016103999A (ja) * 2014-11-05 2016-06-09 アジレント・テクノロジーズ・インクAgilent Technologies, Inc. ゲノム位置に標的濃縮配列リードを割り当てるための方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IIDA, KEIKO ET AL.: "Rice species-specific polymorphism detection", ABSTRACTS OF CONSORTIUM OF BIOLOGICAL SCIENCES 2017, 15 November 2017 (2017-11-15) *
ISHII, KAZUO ET AL.: "1. Identification of structural mutation, genome information analysis - latest methods and applications of next generation sequence", ALGORITHM TECHNIQUES FOR STRUCTURAL MUTATION TYPE-DETECTION, 18 March 2016 (2016-03-18), pages 2 - 25 *
MIYAO, AKIO ET AL.: "Segment analysis of rice species by comparion of genome-wide SNP map", THE 39TH ANNUAL MEETING OF THE MOLECULAR BIOLOGY SOCIETY OF JAPAN, 16 November 2016 (2016-11-16) *

Also Published As

Publication number Publication date
JPWO2019022019A1 (ja) 2020-05-28
JP7122006B2 (ja) 2022-08-19
TW201921277A (zh) 2019-06-01

Similar Documents

Publication Publication Date Title
Kim et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype
Minnoye et al. Chromatin accessibility profiling methods
Goerner-Potvin et al. Computational tools to unmask transposable elements
US20180225416A1 (en) Systems and methods for visualizing a pattern in a dataset
AU2023251541A1 (en) Deep learning-based variant classifier
US20060286566A1 (en) Detecting apparent mutations in nucleic acid sequences
Merkel et al. Detecting short tandem repeats from genome data: opening the software black box
CN110997936A (zh) 基于低深度基因组测序进行基因分型的方法、装置及其用途
Goswami et al. RNA-Seq for revealing the function of the transcriptome
US20220375544A1 (en) Kit and method of using kit
JP7166638B2 (ja) 多型検出法
JP7122006B2 (ja) 挿入・欠失・逆位・転座・置換検出法
Peretz-Machluf et al. Genome-wide noninvasive prenatal diagnosis of de novo mutations
Seol et al. A multilayered screening method for the identification of regulatory genes in rice by agronomic traits
Marić et al. Approaches to metagenomic classification and assembly
WO2017136606A1 (fr) Appareils, systèmes, et procédés destinés à l'amplification d'adn avec filtrage des données post-séquençage et isolement des cellules
Barrabés et al. Genomic Databases Homogenization with Machine Learning
Sánchez Practical Transcriptomics: Differential gene expression applied to food production
Attimonelli et al. Bioinformatics resources, databases, and tools for human mtDNA
JP2021104016A (ja) 転移因子検出法
Toivonen et al. Data mining for gene mapping
Chuan Next-generation sequencing and bioinformatics analysis of plant pathogenic bacteria causing potato zebra chips and ring rot
Barreira et al. AniProtDB: A Collection of Uniformly Generated Metazoan Proteomes for Comparative Genomics Studies
Mansueto et al. CannSeek? Yes we Can! An open-source SNP database and analysis portal for Cannabis sativa
Delahaye Haplotype phasing from long reads with ASP: a flexible optimization approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18837423

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019532604

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18837423

Country of ref document: EP

Kind code of ref document: A1