WO2015151758A1 - 配列データ解析装置、dna解析システムおよび配列データ解析方法 - Google Patents

配列データ解析装置、dna解析システムおよび配列データ解析方法 Download PDF

Info

Publication number
WO2015151758A1
WO2015151758A1 PCT/JP2015/057348 JP2015057348W WO2015151758A1 WO 2015151758 A1 WO2015151758 A1 WO 2015151758A1 JP 2015057348 W JP2015057348 W JP 2015057348W WO 2015151758 A1 WO2015151758 A1 WO 2015151758A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
character
sample
character string
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2015/057348
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
木村 宏一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi High Tech Corp
Original Assignee
Hitachi High Technologies Corp
Hitachi High Tech Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi High Technologies Corp, Hitachi High Tech Corp filed Critical Hitachi High Technologies Corp
Priority to US15/301,086 priority Critical patent/US10810239B2/en
Priority to GB1616668.8A priority patent/GB2539596B/en
Priority to CN201580014840.6A priority patent/CN106104541B/zh
Priority to DE112015001637.6T priority patent/DE112015001637T5/de
Publication of WO2015151758A1 publication Critical patent/WO2015151758A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to a sequence data analysis apparatus, a DNA analysis system, and a sequence data analysis method.
  • Genomic DNA has its base sequence already sequenced (read) throughout the DNA, and its base character string is made public on a server on the Internet.
  • the researcher detects a mutation in the sample DNA by comparing the base position (genome mapping) with the sample DNA fragment of the subject read by the sequencer device using the genomic DNA as a model (reference data).
  • the mutation is, for example, a difference between a base character string of a genomic DNA such as a single nucleotide polymorphism (SNP: Single Nucleotide Polymorphism) or SV (structural mutation) and a base character string of a sample DNA fragment.
  • SNP Single nucleotide Polymorphism
  • SV structural mutation
  • the sequencer device can read only a limited length of DNA sequence, a limited length (only a part of the sample DNA fragment) is obtained from each end of each sample DNA fragment aligned to a substantially constant length.
  • a paired end method is known that handles two (paired) read sequences. That is, in the paired end method, there is a non-sequential section that does not belong to the read sequence of either pair at the center of the sample DNA fragment. Therefore, when mapping the read sequence of a pair read by the paired-end method to the base position in the genomic DNA, the position of the other read sequence is also compared to the position of only one read sequence of the pair. By collating the position, one sample DNA fragment can be mapped into the genomic DNA with high accuracy.
  • Patent Document 1 In genome-wide transcriptome profiling, sequencing of paired-end ditags that maintain the order from the 5 'end to the 3' end on the full-length cDNA is performed, and the ditag sequence adjacent to the specified spacer sequence is determined.
  • An efficient detection method is known (Patent Document 1).
  • Non-Patent Document 1 In order to align a large number of short reads obtained from the new DNA sequencer with a reference genome sequence at high speed and with high accuracy, data obtained by converting the genome sequence into BW (Burrows-Wheeler) is created and used to A technique for rapidly identifying a position where a sequence that matches a seed sequence of several tens of bases at the beginning of a read appears in a reference genome is widely used (for example, Non-Patent Document 1).
  • mapping of paired short leads is also widely performed (for example, Non-Patent Document 1).
  • two short leads are in a pair, they are given the same name (eg, QNAME in Non-Patent Document 1).
  • Non-Patent Document 3 a method using a large amount of data obtained by mapping a large amount of short reads to a reference genome is generally widely used (for example, Non-Patent Document 3).
  • an approach using BW conversion of a reference genome sequence and BW conversion of a short read sequence can be considered (Patent Document 2, Non-Patent Document). 4).
  • JP 2008-547080 A Japanese Patent Application No. 2013-038919
  • an identifier (or pointer information) for associating one read array with the other read array. Necessary. Since these identifiers must be able to uniquely identify the partner of the pair, they require a large number of bits. For this reason, the data size of the identifier increases, which causes a reduction in calculation efficiency. For example, if there are 8 billion short reads with a length of 100 bases, 4 billion identifiers are required, and at least 4 bytes are required per identifier. The size of identifier data for all reads reaches 4 GB (gigabytes).
  • the main object of the present invention is to efficiently obtain the other lead sequence from one lead sequence of the pair sequenced from both ends of the sample DNA fragment.
  • the sequence data analysis apparatus of the present invention provides: A binding character string in which a left sequence and a right sequence, which are pairs sequenced from both ends of the sample DNA fragment, are connected by a binding character for each sample DNA fragment, A lead dictionary creation unit that creates a lead sequence dictionary based on a character string obtained by combining the combined character string with a terminal character; A query search unit for searching for a hit position that is a base character coordinate in the lead sequence dictionary in which a query sequence generated from a base character string of genomic DNA appears; Starting from the hit position in the lead sequence dictionary, the character string until the end character located around it appears as a sample sequence, Starting from the hit position in the sample array, the left side array until the end character on the side where the hit position does not exist from the inspected combined character is inspected around the connected character Or a sample restoration unit that extracts the right sequence as a mate sequence; A mapping unit that searches for base character coordinates of a base character string in the genomic DNA in which the mate
  • the other lead sequence can be efficiently obtained from the one lead sequence of the pair sequenced from both ends of the sample DNA fragment.
  • FIG. 4A shows a process for creating a combined character string from the lead array.
  • FIG. 4B shows a process for creating a BW character string from the combined character string. It is explanatory drawing which shows the conversion process to Wavelet
  • FIG. 5A shows a binary tree referred to for conversion to the Wavelet Tree format.
  • FIG. 5B shows processing for converting a BW character string into a Wavelet Tree format.
  • FIG. 7A shows SNP information.
  • FIG. 7B shows a query sequence created from the SNP information and the genome sequence dictionary.
  • FIG. 7C shows a state in which a query sequence is searched from the lead sequence dictionary.
  • FIG.7 (d) shows the analysis result of SNP. It is flow explanatory drawing which shows the analysis process of the structural variation regarding one Embodiment of this invention.
  • FIG. 9A shows structural variation information.
  • FIG. 9B shows a query sequence created from the structural variation information and the genome sequence dictionary.
  • FIG. 9C shows a state where a query sequence is searched from the read sequence dictionary.
  • FIG. 9D shows the analysis result of the structural variation.
  • FIG. 1 is a configuration diagram showing a DNA analysis system.
  • the array data analyzing apparatus 1 is realized by a computer such as a server having a normal computer configuration.
  • the sequence data analysis apparatus 1 displays a central processing unit (CPU) 201, a memory 202 that is a storage unit for storing programs, a GUI (Graphical User Interface) for operation, analysis results, and the like.
  • a display unit 203 a hard disk drive (HDD) 204 that functions as a storage unit for storing a sequence dictionary (the lead sequence dictionary 14 and the genome sequence dictionary 15 in FIG. 2), a keyboard for inputting mutation information and parameters, etc.
  • a network interface (NIF) 206 for connecting to the input unit 205 and the Internet is connected to the bus 207.
  • NIF network interface
  • the array dictionary stored in the HDD 204 may be stored in a storage device installed outside the array data analysis apparatus 1, or may be stored in a data center or the like via a network.
  • Various flowcharts described below are realized by program execution of the CPU 201 and the like.
  • the genome server 8 and the sequencer 9 are connected to the NIF 206 of the sequence data analysis apparatus 1 via a network.
  • the sequencer 9 sequences (reads) a pair of both ends (5′-end read sequence and 3′-end read sequence) for each sample DNA fragment, and the result is the sequence data analysis apparatus 1.
  • base sequence As a notation method of the lead sequence (base sequence), a method in which the base character at the 5 ′ end side is written on the left side and the base character at the 3 ′ end side is written on the right side is common. 'End side is "left" and 3' end is "right".
  • the sequencer 9 is configured as a massively parallel type (so-called next generation type) DNA sequencer, and can sequence a large number (for example, 100 million pieces) of sample DNA fragments in parallel.
  • the genome server 8 provides the sequence data analysis apparatus 1 with a genome sequence that is a result of sequencing the genomic DNA.
  • the paired end sequencing data of the cDNA sample should be the target of analysis.
  • splice variants can be detected. Loss of an exon due to a splice variant corresponds to a “deletion” of a structural mutation, and incorporation of a new exon corresponds to an “insertion” of a structural mutation.
  • FIG. 2 is a configuration diagram showing the sequence data analysis apparatus 1.
  • the sequence data analysis apparatus 1 receives input of a read sequence set 11 (left array 11a and right array 11b) from the sequencer 9.
  • the read sequence set 11 is a set of read sequences (left sequence 11 a and right sequence 11 b) for each sample DNA fragment sequenced by the sequencer 9.
  • the left sequence 11a is a lead sequence that is sequenced toward the 3 ′ end from the 5 ′ end of the sample DNA fragment.
  • the right sequence 11b is a lead sequence that is sequenced toward the 5 ′ end side from the 3 ′ end side end point of the sample DNA fragment.
  • the center 100 bases are also located on the right side of the left sequence 11a. This is a part that is not included in the array 11b and is not subject to sequencing.
  • the place not to be sequenced is about 19800 bases.
  • the lead dictionary creation unit 21 creates a lead sequence dictionary 14 from the lead sequence set 11.
  • the genome dictionary creation unit 22 creates a genome sequence dictionary 15 from the reference genome sequence 12.
  • the reference genome sequence 12 is determined for each species to be analyzed, and is a set of sequences of the full length of each chromosome.
  • the sequence data analyzing apparatus 1 accepts input of SNP information 13a or structural mutation information 13b as analysis information 13 indicating a mutation to be analyzed this time.
  • the SNP is that the base content of the sample DNA fragment at a specific location is different from the base content of the same location of the genomic DNA.
  • a structural mutation is an insertion or deletion of a sequence of a plurality of consecutive bases.
  • the query creation unit 23 creates a query sequence 16 including the mutation indicated by the analysis information 13 with reference to the genome sequence dictionary 15.
  • the query search unit 24 searches the lead sequence dictionary 14 for a hit position 17a that is a base position coordinate to be mapped to the query sequence 16 (where the query sequence 16 appears).
  • the sample restoration unit 25 restores the read sequence (sample sequence 17) of the sample DNA fragment including the hit position 17a.
  • the hit position 17a is included in either the left array 11a or the right array 11b, one lead array that does not include the hit position 17a is referred to as a mate array 17b.
  • the mapping unit 26 refers to the genome sequence dictionary 15 and identifies the position in the genomic DNA from which the mate sequence 17b of the sample DNA fragment is derived (genome mapping).
  • the sample determination unit 27 determines whether or not the sample DNA fragment of the mate sequence 17a includes a mutation indicated by the analysis information 13 based on the success or failure of the mapping of the mate sequence 17b. Then, the sample determination unit 27 outputs the determination result of the analysis information 13.
  • FIG. 3 is an explanatory diagram of a flow showing creation processing of each dictionary.
  • the lead dictionary creating process (S101 to S105) in which the lead dictionary creating unit 21 creates the lead array dictionary 14 from the lead array set 11 will be described.
  • the sequence data analysis apparatus 1 accepts input of a read sequence set 11 (left array 11 a and right array 11 b) from the sequencer 9.
  • FIG. 4 (a) two sample DNA fragments 301 and 305 are illustrated for easy understanding of the explanation.
  • the left sequence 11a “GA” of the sample DNA fragment 301 has two bases, and the right sequence 11b “T” "1 base, 2 bases of left sequence 11a” C "of sample DNA fragment 305, 2 bases of right sequence 11b” TA ".
  • the sequencer 9 notifies the sequence data analysis apparatus 1 of the read array set 11 in the FASTQ format, as indicated by reference numeral 361 indicating the left array 11a and reference numeral 362 indicating the right array 11b.
  • Reference numerals 361 and 362 list the read sequences of two sample DNA fragments 301 and 305, and four lines of information are described for one read sequence.
  • the first line “@ seq1, @ seq2” in the FASTQ format is an identifier (ID) of the sample DNA fragment
  • the second line “GA, T, C, TA” is a lead sequence. For example, since the first line “@ seq1” of the reference numeral 361 and the first line “@ seq1” of the reference numeral 362 match, it can be seen that the pair is read from the same sample DNA fragment 301.
  • the lead dictionary creation unit 21 joins a pair of base characters of the left sequence 11a and a base sequence of the right sequence 11b with a combination character “&” to form one sample DNA. Create a combined string that represents a fragment.
  • the base letters are “A, C, G, T” indicating four types of bases and “N” indicating an unknown base.
  • FIG. 4A the left sequence 11a “GA” and the right sequence 11b “T”, which are a pair of the same sample DNA fragment 301, are joined by a joining character “&” 302, followed by a termination character “$”. ”303 is added to obtain a combined character string 304“ GA & T $ ”.
  • a combined character string 306 “C & TA $” is obtained from the pair of sample DNA fragments 305.
  • a plurality of types of binding characters may be used properly. For example, “&” is used for a bond character in a bond character string generated from a sample DNA fragment of about 300 bases, and “#” is used for a bond character in a bond character string generated from a sample DNA fragment of about 20,000 bases. It may be used. Thus, not only the pair partner of the sample DNA fragment but also the length of the sample DNA fragment can be obtained from the combined character.
  • the lead dictionary creation unit 21 performs BW conversion on the combined character strings 304 and 306 to create a BW character string 311.
  • the BW character string 311 is created by the following procedure. In this calculation process, when comparing character strings by comparing characters one by one from the beginning, the comparison is terminated when $ s are compared, and the comparison is continued when & are compared.
  • the combined character string 304 is cyclically shifted to obtain a character string list 307
  • the combined character string 306 is also cyclically shifted to obtain a character string list 308.
  • the merged list 309 is obtained by merging the two lists 307 and 308.
  • Sorted list 310 is obtained by sorting merged list 309 in alphabetical order. At this time, the character sorting order is, for example, “$ ⁇ # ⁇ & ⁇ A ⁇ C ⁇ G ⁇ T ⁇ N”. (Procedure 4) The character at the end of each line of the sorted list 310 is concatenated to obtain the BW character string 311. Since the BW character string 311 obtained in this way has already been sorted, the same character is frequently repeated. Therefore, the data amount can be compressed by run-length compressing the BW character string 311.
  • the lead dictionary creation unit 21 creates the lead sequence dictionary 14 for efficiently performing the search by converting the BW character string 311 into the Wavelet Tree format.
  • FIG. 5A shows a binary tree 320 referred to for conversion to the Wavelet Tree format.
  • all characters ($, &, A, C, G, T, N) 321 used in the character string are the root.
  • the binary tree 320 recursively classifies all characters ($, &, A, C, G, T, N) 321 used in the character string, and includes at most two kinds of characters at the end of the classification. It is a binary tree showing how to classify as follows.
  • a Wavelet Tree 340 in FIG. 5B is a binary tree showing the BW character string 311.
  • the root of the Wavelet Tree 340 is obtained by converting the BW character string 311 into the binary character string 341 according to the two classifications of W and S of the binary tree 320.
  • the lead dictionary creation unit 21 creates a partial character string 342 extracted from characters classified as W and a partial character string 343 extracted from characters classified as S.
  • the lead dictionary creation unit 21 converts the partial character string 342 (which includes only two types of characters A and T) into a binary character string 344 of 0 and 1.
  • the lead dictionary creation unit 21 creates the binary character string 345 according to the two classifications (indicated by the binary tree 320) of the characters classified into S into M and K.
  • the lead dictionary creation unit 21 creates a partial character string 346 extracted from characters classified as M and a partial character string 347 extracted from characters classified as K.
  • a character string 350 composed of a symbol & (302) representing a pair and a terminal symbol $ (303) is represented by a binary character string 351.
  • the length (number of bits) of the binary character string 351 is equal to the total number of reads (twice the total number of pairs). Therefore, the Wavelet Tree 340 is obtained by reversibly converting the BW character string 311, and the BW character string 311 can be restored from the Wavelet Tree 340.
  • the lead dictionary creation unit 21 outputs the lead sequence dictionary 14 created from the Wavelet Tree 340 to the storage unit.
  • the lead dictionary creation unit 21 uses a slight amount of data (about 3.5% of the BW character string 311 to calculate the rank function and select function of the Wavelet Tree 340).
  • Auxiliary data may be added to the lead sequence dictionary 14.
  • rank (p, c) is a function that returns the number of appearances of the character “c” in the array elements 0 to p.
  • select (i, c) is a function that returns the array position where the (i + 1) th character “c” appears.
  • the auxiliary data is, for example, “hierarchical binary string” described in the reference “Kouichi Kimura, Yutaka Suzuki, Sumio Sugano, and Asako Koike. Journal of Computational Biology. November 2009, 16 (11): 1601-1613.” .
  • This auxiliary data is data for efficiently performing a search such as obtaining all read sequence fragments that match a given base sequence from the BW character string 311.
  • the lead dictionary creation processing (S101 to S105) has been described above with reference to FIGS.
  • One genome dictionary creation process (S101b to S105b in FIG. 3) can be created similarly to the lead dictionary creation process (S101 to S105).
  • the genome dictionary creation unit 22 receives an input of the reference genome sequence 12 instead of the read sequence set 11 in S101.
  • the genome dictionary creation unit 22 creates a single character string by linking the chromosome sequences (base character strings) of a plurality of genomic DNAs indicated by the reference genome sequence 12 with the terminal character “$” as they are.
  • the pair linking process by the combined character “&” as in S102 is unnecessary.
  • the genome dictionary creation unit 22 outputs the genome sequence dictionary 15 instead of the read sequence dictionary 14 in S105.
  • FIG. 6 is an explanatory diagram of the flow showing the SNP analysis processing.
  • the sequence data analyzing apparatus 1 accepts input of the SNP information 13a as the analysis information 13 indicating the mutation to be analyzed this time.
  • FIG. 7A illustrates a table 400 indicating the SNP information 13a. As shown in each row of the table 400, for each SNP, information on the chromosome name, the base position coordinates on the chromosome, the type of base in the reference genome sequence (standard base), and the type of base that appears as a SNP (mutant base) Including.
  • the first line of the table 400 indicates that the SNP is located at the 123456th base position on chromosome 7 and the base “A” of the reference genome is mutated to the base “G”.
  • the query creation unit 23 creates the query sequence 16 including the SNP indicated by the SNP information 13a with reference to the genome sequence dictionary 15.
  • the explanation column 420 in FIG. 7B is an example of S122, and the horizontal axis 421 is the base position coordinates on the chromosome.
  • the query creation unit 23 refers to the genome sequence dictionary 15 to obtain a base sequence 422 around the SNP position 424 (for example, about 10 bases on the left and right) indicated in the first row of the table 400.
  • the query creation unit 23 creates a sequence 423 obtained by mutating the base of the base sequence 422 at the SNP position 424, and uses this as the query sequence 16.
  • the query preparation part 23 can detect the standard base which appears in the position 424 of SNP by making the base sequence 422 which does not contain a mutation into the query sequence 16 instead of the base sequence 423 containing a mutation.
  • the query search unit 24 searches the lead sequence dictionary 14 for a hit position 17a that is a base position coordinate to be mapped to the query sequence 16 (a sequence of base character strings is matched).
  • a hit position 17a that is a base position coordinate to be mapped to the query sequence 16 (a sequence of base character strings is matched).
  • the terminal character “$" If the character strings are compared, the comparison is terminated. If the combined character strings “&” are compared, the comparison is continued.
  • the sequence data analyzing apparatus 1 executes a loop process for each hit position 17a obtained in S123.
  • the sample restoration unit 25 restores the read sequence (sample sequence 17) of the sample DNA fragment including the hit position 17a currently selected in the loop from S131.
  • this restoration process by using the rank function and the select function from the read array dictionary 14 created by BW conversion, the inside of the read array dictionary 14 starts from the hit position 17a until the end character “$” appears. By extending the scanning, the sample array 17 sandwiched between the left and right terminal characters “$” is acquired.
  • a method using such a rank function and a select function is described in, for example, the document “Ferragina, P.
  • the sample array 17 includes a combined character “&” and a terminal character “$” as in the combined character strings 304 and 306.
  • a combined character string “&” By separating the sample sequence 17 by the combined character string “&”, two lead sequences (left sequence 11a and right sequence 11b) forming a pair can be obtained without using an identifier for each sample DNA fragment. .
  • the sample restoration unit 25 scans (inspects) the combined character “&” from the sample array 17 to obtain the mate array 17b as shown below (S134).
  • the combined character “&” appears to the right from the array 423 at the hit position 17a, so that the acquired mate array 17b has the combined character “&” It is the right arrangement 11b on the right side of “.
  • the acquired mate array 17b includes the combined character “&”. It is the left arrangement 11a on the left side of “.
  • the sample array Reference numeral 17 denotes a single read array having no partner constituting a pair. Such a single lead sequence may be considered as low reliability and ignored. Alternatively, only when it is confirmed that there is only one place where the sequence 423 appears in the genome by querying the genome sequence dictionary 15 for the sequence 423 before the mutation is introduced (although reliability is low), You may determine with having detected.
  • the mapping unit 26 refers to the genome sequence dictionary 15 to identify the position in the genomic DNA from which the mate sequence 17b acquired in S134 is derived (genome mapping). For example, the mapping unit 26 cuts out a short partial sequence (for example, about 20 bases) from the mate sequence 17b and inquires of the genomic sequence dictionary 15 whether or not the partial sequence appears in only one place in the genome. . If no place appears, the mapping unit 26 makes a query again with another partial sequence because the partial sequence is considered to contain a sequencing error or polymorphism. If it appears at a plurality of locations, the mapping unit 26 increases the length of the partial sequence or makes another inquiry using another partial sequence.
  • genomic mapping for example, the mapping unit 26 cuts out a short partial sequence (for example, about 20 bases) from the mate sequence 17b and inquires of the genomic sequence dictionary 15 whether or not the partial sequence appears in only one place in the genome. . If no place appears, the mapping unit 26 makes a query again with another partial sequence because the partial sequence is considered to contain a sequencing error or polymorph
  • mapping is successful.
  • mapping fails. If Yes in S135, the process proceeds to S136. If No, the current loop is terminated (S139), and the process returns to S131 to select the next hit position.
  • the sample determination unit 27 determines whether or not the sample DNA fragment of the sequence 17a in S132 includes the SNP mutation indicated by the analysis information 13 based on whether the distance is normal (matched) as described below. To do. For example, if the distance between the mapping position of the mate sequence 17b successfully mapped and the SNP position (hit position 17a) indicated by the SNP information 13a is substantially equal to the length of the sample DNA fragment, the sample determination unit 27 Assuming that the left array 11a and the right array 11b constituting the array 17 are matched, it is determined that “SNP detection” (that is, matching with the SNP information 13a). If Yes in S136, the process proceeds to S137, and if No, the process proceeds to S138.
  • the sample determination unit 27 increments the detection number counter value of the SNP information 13a determined to be “SNP detection” by one. In S ⁇ b> 138, the sample determination unit 27 does not increase the detection number counter value when it is determined as “SNP non-detection”.
  • the sample determination unit 27 analyzes information 13 that associates the SNP information 13a with its detection number counter value (the number of sample DNA fragments in which the SNP is detected). The determination result is output (reported to the user).
  • the table 460 in FIG. 7D is an example of information output in S141.
  • the table 460 shows the number of lead fragments (mutant base detection number) in which a mutant base (SNP) is detected at the SNP position 424 and the standard base for each SNP shown in the table 400 of FIG.
  • the number of detected read fragments (standard base detection number) is read from the detection number counter value and written.
  • FIG. 8 is an explanatory diagram of the flow showing the structural mutation analysis processing. Focusing on the difference from the SNP analysis processing of FIG. 6, FIG. 8 will be described below.
  • the sequence data analyzing apparatus 1 accepts input of structural variation information 13b instead of the SNP information 13a as analysis information 13 indicating the mutation to be analyzed this time.
  • the structural mutation information 13b in FIG. 9A includes, for each mutation, the name of the chromosome, the base position coordinates on the chromosome, the type of mutation (insertion or deletion), and the mutation length. Contains information.
  • the first line of the table 600 indicates that a structural mutation is located at the position of the 654321 base on chromosome 3 and that a deletion in which a sequence of 500 consecutive bases is lost from the standard genome occurs.
  • the query creation unit 23 creates the query sequence 16 around the structural mutation indicated by the structural mutation information 13b with reference to the genome sequence dictionary 15.
  • An explanation column 620 in FIG. 9B shows a method of creating the query sequence 16 in S122b.
  • the horizontal axis 421 is the base position coordinates on the chromosome.
  • short base sequences 622 and 623 around the position 624 where structural variation occurs for example, positions that are separated by several tens of bases on the left and right
  • these are used as query sequences 16 respectively. That is, the query sequence 16 is a short sequence near (to the left or right) position 624 where the structural variation occurs.
  • S133b of FIG. 8 after execution of the processing of S133 (not shown in FIG. 8), it is determined whether or not it is not subject to structural mutation analysis (mutation information cannot be obtained). If Yes in S133b, the current loop is terminated (S139), and if No, the process proceeds to S134.
  • the determination process of S133b will be described with reference to the description column 640 of FIG.
  • the combined character “&” in the sample array 17 restored to the right from the query array 622 on the left side of the position 624 that is the hit position 17a is restored.
  • the query array 622 is included in the left array 11a. Therefore, the array 641 on the right side of the combined character “&” (right array 11b) is restored as the mate array 17b.
  • a position 624 where a structural variation occurs between the query sequence 622 and the sequence 641 may be included.
  • illustration is omitted, when the combined character “&” appears in the restored character string extended to the left of the query sequence 622, there is no possibility of structural variation. Not obtained).
  • the query array 623 on the right side of the position 624 that is the hit position 17a is also reversed left and right with respect to [1] in FIG. 9C. Then, the same determination is performed.
  • the query array 623 is included in the right array 11b. Therefore, the left array 642 (left array 11a) of the combined character “&” is restored as the mate array 17b.
  • a position 624 where a structural variation occurs between the query sequence 623 and the sequence 642 may be included.
  • the sample determination unit 27 determines that the distance between the pair of the mate sequence 17b and the read sequence of the pair in the sample sequence 17 that is the extraction source is substantially equal to the length of the sample DNA fragment. The distance is evaluated to be a normal value (S136b, Yes). If No in S136b, a structural mutation has been detected, so the sample determination unit 27 increments the counter value of “the number of detected mutations” in the corresponding structural mutation information 13b by one (S137b). Here, the sample determination unit 27 may count the number of detections separately for the “deletion” mutation type and the “insertion” mutation type.
  • the sample determination unit 27 increments the counter value of “the number of detections without mutation” in the corresponding structural mutation information 13b by one (S138b).
  • the sequence data analysis apparatus 1 outputs information as shown in the table 660 of FIG. 9D in order to report the counter values of S137b and S138b as the structural mutation detection results.
  • this table 660 for each structural variation shown in the table 600, the number of read fragments determined to be mutated and the number of read fragments determined to be non-mutated (ie, consistent with the standard genome) are shown. Reported as the number of detections with mutation and the number of detections without mutation.
  • the lead dictionary creation unit 21 connects the left sequence 11a and the right sequence 11b, which are pairs of sample DNA fragments, with a connecting character, and terminates the pair of sample DNA fragments. Create a combined string by combining with. Then, the lead dictionary creation unit 21 creates the lead sequence dictionary 14 in which the combined character string is BW converted into a wavelet tree.
  • the sample restoration unit 25 restores a pair of sample DNA fragments (mate sequence 17b) including the hit position 17a from the hit position 17a of the query sequence 16 in the lead sequence dictionary 14, it is embedded in the read sequence dictionary 14.
  • the mate array 17b can be restored using the combined character as a clue. Therefore, since an identifier for each sample DNA fragment is not necessary, a mate sequence 17b that forms a pair with an arbitrary read sequence can be calculated efficiently.
  • the data amount and processing amount which the sequence data analysis apparatus 1 bears can be reduced.
  • an identifier for each sample DNA fragment is used like “@ seq1” of reference numeral 361 in FIG. 4, the read sequence set 11 read from the sample DNA fragment is combined with the left sequence 11a and the right sequence 11b. If there are 1 billion, 500 million kinds of identifiers are required. If the data size is 4 bytes per identifier, the size of all identifier data requires 4 gigabytes. Further, when an identifier for each sample DNA fragment is used, a load is required to search for the paired mate sequence 17b using the identifier of the hit lead sequence as a search key.
  • this invention is not limited to an above-described Example, Various modifications are included.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment.
  • Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
  • Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
  • Information such as programs, tables, and files for realizing each function is stored in memory, a hard disk, a recording device such as an SSD (Solid State Drive), an IC (Integrated Circuit) card, an SD card, a DVD (Digital Versatile Disc), etc. Can be placed on any recording medium.
  • a recording device such as an SSD (Solid State Drive), an IC (Integrated Circuit) card, an SD card, a DVD (Digital Versatile Disc), etc.
  • the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2015/057348 2014-04-03 2015-03-12 配列データ解析装置、dna解析システムおよび配列データ解析方法 Ceased WO2015151758A1 (ja)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US15/301,086 US10810239B2 (en) 2014-04-03 2015-03-12 Sequence data analyzer, DNA analysis system and sequence data analysis method
GB1616668.8A GB2539596B (en) 2014-04-03 2015-03-12 Sequence data analyzer, DNA analysis system and sequence data analysis method
CN201580014840.6A CN106104541B (zh) 2014-04-03 2015-03-12 序列数据分析装置、dna分析系统以及序列数据分析方法
DE112015001637.6T DE112015001637T5 (de) 2014-04-03 2015-03-12 Sequenzdatenanalysator, DNA-Analysesystem und Sequenzdatenanalyseverfahren

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014077278A JP6198659B2 (ja) 2014-04-03 2014-04-03 配列データ解析装置、dna解析システムおよび配列データ解析方法
JP2014-077278 2014-04-03

Publications (1)

Publication Number Publication Date
WO2015151758A1 true WO2015151758A1 (ja) 2015-10-08

Family

ID=54240090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/057348 Ceased WO2015151758A1 (ja) 2014-04-03 2015-03-12 配列データ解析装置、dna解析システムおよび配列データ解析方法

Country Status (6)

Country Link
US (1) US10810239B2 (https=)
JP (1) JP6198659B2 (https=)
CN (1) CN106104541B (https=)
DE (1) DE112015001637T5 (https=)
GB (1) GB2539596B (https=)
WO (1) WO2015151758A1 (https=)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017224191A (ja) * 2016-06-16 2017-12-21 株式会社日立製作所 Dna配列解析装置、dna配列解析方法及びdna配列解析システム
CN111782609A (zh) * 2020-05-22 2020-10-16 北京和瑞精准医学检验实验室有限公司 一种快速将fastq文件均匀分片的方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US12071669B2 (en) 2016-02-12 2024-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for detection of abnormal karyotypes
CN106446537B (zh) * 2016-09-18 2018-07-10 北京百度网讯科技有限公司 结构体变异检测的方法、设备及系统
US20220199199A1 (en) * 2019-02-07 2022-06-23 Biokey Bv Biological sequence information handling
WO2021134574A1 (zh) * 2019-12-31 2021-07-08 深圳华大智造科技有限公司 创建基因突变词典及利用基因突变词典压缩基因组数据的方法和装置
WO2022054178A1 (ja) * 2020-09-09 2022-03-17 株式会社日立ハイテク 個体ゲノムの構造変異検出方法及び装置
KR102265937B1 (ko) * 2020-12-21 2021-06-17 주식회사 모비젠 시퀀스데이터의 분석 방법 및 그 장치
CN114550828B (zh) * 2022-01-28 2024-12-10 赛纳生物科技(北京)有限公司 一种基因序列的比对方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008547080A (ja) * 2005-06-14 2008-12-25 エイジェンシー・フォー・サイエンス,テクノロジー・アンド・リサーチ ダイタグ配列の処理および/またはゲノムマッピングの方法
JP2009116559A (ja) * 2007-11-06 2009-05-28 Hitachi Ltd 大量配列の一括検索方法及び検索システム
JP2014502513A (ja) * 2011-01-14 2014-02-03 キージーン・エン・フェー ペアエンドランダムシーケンスに基づく遺伝子型解析

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPS115502A0 (en) * 2002-03-18 2002-04-18 Diatech Pty Ltd Assessing data sets
JP5985040B2 (ja) 2013-02-28 2016-09-06 株式会社日立ハイテクノロジーズ データ解析装置、及びその方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008547080A (ja) * 2005-06-14 2008-12-25 エイジェンシー・フォー・サイエンス,テクノロジー・アンド・リサーチ ダイタグ配列の処理および/またはゲノムマッピングの方法
JP2009116559A (ja) * 2007-11-06 2009-05-28 Hitachi Ltd 大量配列の一括検索方法及び検索システム
JP2014502513A (ja) * 2011-01-14 2014-02-03 キージーン・エン・フェー ペアエンドランダムシーケンスに基づく遺伝子型解析

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HENG LI ET AL.: "Fast and accurate short read alignment with Burrows-Wheeler transform", BIOINFORMATICS, vol. 25, no. 14, 15 July 2009 (2009-07-15), pages 1754 - 1760, XP055123443 *
HENG LI ET AL.: "The Sequence Alignment/Map format and SAMtools", BIOINFORMATICS, vol. 25, no. 16, pages 2078 - 2079, XP055229864 *
KOICHI KIMURA ET AL.: "LOCALIZED SUFFIX ARRAY AND ITS APPLICATIONTO GENOME MAPPING PROBLEMS FOR PAIRED-ENDSHORT READS", GENOME INFORMATICS, vol. 23, no. 1, October 2009 (2009-10-01), pages 60 - 71 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017224191A (ja) * 2016-06-16 2017-12-21 株式会社日立製作所 Dna配列解析装置、dna配列解析方法及びdna配列解析システム
CN111782609A (zh) * 2020-05-22 2020-10-16 北京和瑞精准医学检验实验室有限公司 一种快速将fastq文件均匀分片的方法
CN111782609B (zh) * 2020-05-22 2023-10-13 北京和瑞精湛医学检验实验室有限公司 一种快速将fastq文件均匀分片的方法

Also Published As

Publication number Publication date
DE112015001637T5 (de) 2017-02-09
GB2539596B (en) 2021-03-17
JP6198659B2 (ja) 2017-09-20
GB201616668D0 (en) 2016-11-16
GB2539596A (en) 2016-12-21
CN106104541A (zh) 2016-11-09
JP2015197899A (ja) 2015-11-09
US10810239B2 (en) 2020-10-20
US20170017717A1 (en) 2017-01-19
CN106104541B (zh) 2018-09-11

Similar Documents

Publication Publication Date Title
JP6198659B2 (ja) 配列データ解析装置、dna解析システムおよび配列データ解析方法
US11702708B2 (en) Systems and methods for analyzing viral nucleic acids
US20200411138A1 (en) Compressing, storing and searching sequence data
AU2015298543B2 (en) Methods and systems for data analysis and compression
US20130166518A1 (en) Compression Of Genomic Data File
Xie et al. GeneMiner: A tool for extracting phylogenetic markers from next‐generation sequencing data
JP5985040B2 (ja) データ解析装置、及びその方法
JP2019537172A (ja) バイオインフォマティクスデータのインデックスを付けるための方法及びシステム
US20130117246A1 (en) Methods of processing text data
Bernardes et al. A multi-objective optimization approach accurately resolves protein domain architectures
WO2010042888A1 (en) A computational method for comparing, classifying, indexing, and cataloging of electronically stored linear information
CN111339293B (zh) 告警事件的数据处理方法、装置和告警事件的分类方法
Chappell et al. K-means clustering of biological sequences
CN110782946A (zh) 识别重复序列的方法及装置、存储介质、电子设备
Ju et al. Fleximer: accurate quantification of RNA-Seq via variable-length k-mers
Vaddadi et al. Read mapping on genome variation graphs
US11515011B2 (en) K-mer based genomic reference data compression
CN113611358A (zh) 样品病原细菌分型方法和系统
Esmat et al. A parallel hash‐based method for local sequence alignment
CN117932518A (zh) 一种地热异常探测方法、装置、电子设备及存储介质
JP2011024473A (ja) アプタマー分類装置、アプタマー分類方法、プログラムおよび記録媒体
EP3418927B1 (en) Method and device for processing dna sequence
CN119226579A (zh) 生成用于k个不匹配搜索的过滤器的系统和方法
WO2022054178A1 (ja) 個体ゲノムの構造変異検出方法及び装置
Liu et al. Fastqzip: an improved reference-based genome sequence lossy compression framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15773886

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 201616668

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20150312

WWE Wipo information: entry into national phase

Ref document number: 1616668.8

Country of ref document: GB

Ref document number: 15301086

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 112015001637

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15773886

Country of ref document: EP

Kind code of ref document: A1