WO2016143062A1

WO2016143062A1 - Sequence data analyzer, dna analysis system and sequence data analysis method

Info

Publication number: WO2016143062A1
Application number: PCT/JP2015/056964
Authority: WO
Inventors: 宏一木村
Original assignee: 株式会社日立ハイテクノロジーズ
Priority date: 2015-03-10
Filing date: 2015-03-10
Publication date: 2016-09-15

Abstract

In the present invention, a sample DNA is analyzed by mapping read sequences, said read sequences being obtained from sequencing data, on two-dimensional genome reference coordinates wherein genome reference coordinates are two-dimensionally arranged.

Description

Sequence data analysis apparatus, DNA analysis system, and sequence data analysis method

The present invention relates to a sequence data analysis apparatus, a DNA analysis system, and a sequence data analysis method.

Genomic DNA (deoxyribonucleic acid) has its base sequence already sequenced (read) throughout the DNA, and its base character string is made public on servers on the Internet. The researcher detects a mutation in the sample DNA fragment by using the genomic DNA as a model (reference data) and collating (genome mapping) with the base position of the sample DNA fragment of the subject read by the sequencer device. The mutation is, for example, a difference between a base character string of genomic DNA such as a single nucleotide polymorphism (SNP: Single Nucleotide Polymorphism) or a structural mutation (SA) and a base character string of a sample DNA fragment. The sample DNA fragment is obtained by fragmenting and replicating one sample DNA into a large number by fragment processing in the sequencer apparatus.

In many cases, SA, which is not inherited genetically, is acquired in the DNA of cells of tumor tissue of cancer patients. It is generally well known that these SAs are related to the progression of disease states and the effectiveness of therapeutic agents.

In SA, DNA fragmented at distant positions on the genome is joined (fused). This divided position is called a break point (BP). It is impossible to reproduce the entire picture of fragmentation and fusion from fragmented sample DNA. Therefore, it is widely performed to indirectly detect BP by analyzing a large amount of sequenced sequence data.

In the sequencer device, instead of converting the entire sample DNA fragment into a base character string, two (pairs) of a substantially constant length (only part of the sample DNA fragment) are read from both ends of one sample DNA fragment. The paired end (PE) method is used to handle the read sequence. In general, in the PE method, there is a non-sequencing section that does not belong to either pair of read sequences at the center of a sample DNA fragment.

In general, when a pair of read sequences read by the PE method is mapped onto a reference genome, they are mapped to a position that is separated by a distance corresponding to the length of a sample DNA fragment under such a condition that they are directed inward from each other. . Such a pair is called a matched pair.

On the other hand, when BP is included in the center of the sample DNA fragment, the mapping position on the reference genome with respect to the paired read sequence does not satisfy such a condition. Such a pair is called a mismatched pair (DP: Discordant Pair).

By selecting and analyzing inconsistent pairs from a large amount of mapping data, the approximate position of the BP can be determined (for example, see Non-Patent Document 1). The error is about the length of the sample DNA fragment.

In addition, when BP is included in the read sequence, it is impossible to perform genome mapping of the read sequence over the entire length. At this time, only a part of the lead sequence may be genome-mapped, and the remaining part may be genome-mapped to another location. Such a read is called a split read (SR). SR is infrequent and difficult to detect. If the SR is found, the position of the BP can be accurately determined (see, for example, Non-Patent Document 2), but it is highly possible that the detection is a false detection caused by a coincidence. In addition, since it is difficult to find the SR itself, the sensitivity of the method for accurately determining the position of the BP using the SR is low.

As a known technique, a reference genome coordinate is taken on one axis, a sequence coordinate of cDNA (complementary DNA) is taken on the other axis, and the exon-intron structure is visualized by displaying the cDNA mapping result two-dimensionally. It is known (see, for example, Patent Document 1). In addition, the genome coordinates of a certain bacterium are taken on one axis, the genome coordinates of another kind of bacterium are taken on the other axis, and the homology between these bacterial genome sequences is plotted in two dimensions. A technique for visualizing a region is known (see, for example, Non-Patent Document 3).

JP 2002-099546 A

However, in any known technique, the analysis of the sample DNA by mapping the read sequence obtained from the sequencing data to the two-dimensional genome reference coordinates in which the genome reference coordinates are two-dimensionally arranged is performed. The technical idea of “estimating the occurrence position” is not described.

As a representative example of the invention for solving the above problem, a method is proposed in which sample DNA is analyzed by mapping a read sequence obtained from sequencing data to a two-dimensional genome reference coordinate in which genome reference coordinates are arranged in two dimensions.

According to the present invention, for example, the occurrence position (existing region) of the breakpoint BP can be estimated with high sensitivity. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.

The figure which shows the whole structure of the DNA analysis system which concerns on one Embodiment. The figure explaining one Embodiment of the flow of the data in a sequence data analyzer. The flowchart explaining one Embodiment of the processing content performed within an arrangement | sequence data analysis apparatus. The figure which shows one Embodiment of a read arrangement | sequence dictionary. The figure which shows one Embodiment of the data format of all the mapping data. The figure explaining the process image of the two-dimensional plot process 313 with respect to DP. The flowchart explaining the processing content of the two-dimensional clustering process 314. FIG. The figure explaining the processing content of the two-dimensional clustering process 314 supplementarily. The figure explaining the calculation method of the probability p which a distribution with which the plot point arranges diagonally 45 degree | times happens accidentally. The figure explaining the method to calculate the two-dimensional existence area M where the overlap of the several existence area which the point corresponding to DP shows becomes the maximum. The flowchart explaining the estimation process 331 of the BP position in the arrangement | sequence A with respect to a tumor sample. The flowchart explaining the method of calculating | requiring the extension arrangement | sequence Ext (Q) of the partial arrangement | sequence Q of length k. The flowchart which shows the processing content performed by the partial process P (b, E, Q, S (EQ), T (EQ)) of FIG. The figure explaining the plot of multivalent function D (x, A) and D (x, B). The figure explaining another one Embodiment of the flow of the data in a sequence data analyzer. The flowchart explaining another one Embodiment of the processing content performed within the arrangement | sequence data analysis apparatus. The figure explaining another one Embodiment of the flow of the data in a sequence data analyzer. The flowchart explaining another one Embodiment of the processing content performed within the sequence data analysis apparatus. The figure which shows the whole structure of the DNA analysis system which concerns on other one Embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiment of the present invention is not limited to the examples described later, and various modifications are possible within the scope of the technical idea.

[Example 1]
(1) Overall Configuration FIG. 1 shows the overall configuration of the DNA analysis system according to Example 1. The array data analysis apparatus 1 is realized by a computer such as a server having a normal computer configuration.

The sequence data analyzing apparatus 1 includes a central processing unit (CPU: Central Processing Unit) 101, a memory 102 for storing programs, a display unit 103 for displaying GUI (Graphical User Interface) for operation and analysis results, all mappings, and the like. A hard disk drive (HDD) 104 for storing data (212 in FIG. 2) and a sequence dictionary (tumor sample lead sequence dictionary 222, normal sample sequence dictionary 223 in FIG. 2) and the like, and an input unit 105 such as a keyboard used for parameter input A network interface (NIF) 106 is connected to a bus 107.

The sequence data analysis device 1 is connected to an external device through a LAN (Local Area Network) connected to the network interface (NIF) 106 or the Internet. The array dictionary stored in the HDD 104 may be stored in a storage device externally connected to the array data analyzing apparatus 1, or may be stored in a data center connected via a network. Various processes described below are realized through execution of programs by the CPU 101. In the case of FIG. 1, a genome sequence server 108 and a DNA sequencer 109 are connected to the NIF 106 via a network.

The DNA sequencer 109 detects both ends of the sample DNA fragment contained in each sample from the tumor DNA sample 110 and the normal DNA sample 111 extracted from the tumor tissue and normal tissue of the patient (the 5 ′ terminal lead sequence and 3 The pair ('read-side read sequence) is sequenced (read), and the result is provided to the sequence data analyzer 1. The notation of the lead sequence (base sequence) is generally a method in which the base character at the 5 ′ end is written on the left side, and the base character at the 3 ′ end side is written on the right side. The side is “left” and the 3 ′ end is “right”.

The DNA sequencer 109 is configured as a massively parallel (so-called next generation) DNA sequencer, and can sequence a large number (for example, 100 million pieces) of sample DNA fragments in parallel. Here, when the length of the left sequence and the right sequence is, for example, about 100 bases, and the sample DNA fragment is about 1,000 bases, the center sequence of about 800 bases is not limited to the left sequence and the right sequence. This is a part that is not included in Sequencing. Similarly, the genome sequence server 108 provides the sequence data analysis apparatus 1 with a genome sequence that is a result of sequencing the genomic DNA.

(2) Data Flow in Sequence Data Analysis Device FIG. 2 shows the data flow in the sequence data analysis device 1. The sequence data analysis apparatus 1 accepts reference genome sequence data 201 from the genome sequence server 108 and accepts tumor sample lead sequence data 202 and normal sample lead sequence data 203 from the DNA sequencer 109. The reference genome sequence data 201 is given to the genome mapping processing unit 211 and the cluster evaluation unit 235. Both the tumor sample lead sequence data 202 and the normal sample lead sequence data 203 are given to the genome mapping processing unit 211 and the lead sequence dictionary creation unit 221.

The genome mapping processing unit 211 performs a mapping process on the reference genome sequence data 201 for all the paired-end read sequences included in the tumor sample lead sequence data 202 and the normal sample read sequence data 203 to obtain all mapping data 212. Create The DP extraction unit 231 selects one that satisfies the condition of the inconsistent pair (DP) from all the mapping data 212 and creates DP two-dimensional plot data 232. The clustering processing unit 233 performs clustering of the two-dimensional plot data 232 and creates two-dimensional cluster data 234. The cluster evaluation unit 235 evaluates each cluster included in the two-dimensional cluster data 234, selects a cluster that is presumed to have a tumor-specific BP, refers to the reference genome sequence data 201, and exists that BP. Area array data (BP existence area array data) 236 is generated.

The lead sequence dictionary creation unit 221 reads the tumor sample lead sequence data 202 and the normal sample lead sequence data 203, and creates a tumor sample lead sequence dictionary 222 and a normal sample sequence dictionary 223. The minimum mismatch rate calculation unit 241 searches the tumor sample lead sequence dictionary 222 and the normal sample sequence dictionary 223 using the sequence in the BP existing region sequence data 236 as a query, and the tumor sample mismatch rate data 242 and the normal sample mismatch rate data. Create 243. The BP position estimation unit 244 analyzes the tumor sample mismatch rate data 242 and the normal sample mismatch rate data 243, estimates the position of the BP, and outputs a BP position estimation result 245. The two-dimensional plot display output unit 251 takes out the cluster information used for BP estimation from the two-dimensional cluster data 234 and displays and outputs it. The mismatch rate plot display output unit 252 displays and outputs the tumor sample mismatch rate data 242 and the normal sample mismatch rate data 243 used for estimating the position of the BP. These display outputs are used to visually confirm the validity of the estimation result to the user.

(3) Details of Sequence Data Analysis Processing Hereinafter, with reference to FIG. 3, details of processing executed by the sequence data analysis apparatus 1 to detect BP and determine its exact position will be described in detail. Each process described later is realized through execution of a program by the central processing unit 101.

(Parameter input processing 301)
In this processing, parameters such as p0, m0, m1, d, L, σ, k, r, s, and e described later are input to the central processing unit 101 through the input unit 105.

(Sequence data input processing 302)
In this process, reference genome sequence data 201 is input from the genome sequence server 108 to the central processing unit 101. Tumor sample lead sequence data 202 and normal sample lead sequence data 203 are input from the DNA sequencer 109 to the central processing unit 101.

(Sequence dictionary creation process 303)
In this process, the lead sequence dictionary creation unit 221 reads the tumor sample lead sequence data 202 and the normal sample lead sequence data 203, and creates the tumor sample lead sequence dictionary 222 and the normal sample lead sequence dictionary 223. The read sequence dictionary is a data structure equivalent to an alphabetical order of all suffixes of all read sequences. For example, “Li H. and Durbin R. Fast and accurate short read alignment with Burrows” -Wheeler Transform. Bioinformatics, 25: 1754-60 (2009) "(hereinafter referred to as" Prior Art 1 ").

The lead sequence dictionary will be described with reference to FIG. In the list 401, all suffixes of all the lead arrays expressed by the lead array dictionary are sorted in alphabetical order and arranged in the vertical direction. Each row corresponds to the suffix of the lead sequence. If the read sequence dictionary is used, it is possible to efficiently obtain the suffix 403 (in the read sequence) having the i-th sort order for any i (402) (prior art 1). Here, it is assumed that 0 ≦ i <N, N is the total number of all suffixes of all read arrays, and the sort order is counted from 0th. The value of N is equal to the sum of the total number of bases included in the read sequence data and the number of read sequences. Also, for an arbitrarily given base character string w (404), S (w) (405) and T (w) (406) are the minimum and maximum values of the suffix sort order starting with w + 1. It expresses. In particular, when w is an empty character string ε, S (ε) = 0 and T (ε) = N.

Here, if S (w) and T (w) are given for a certain character string w, S (nw) and T (nw) are set for any base character n = A, C, G, T. It can be obtained efficiently (prior art 1). Therefore, starting from w = ε and repeating this calculation while extending w to the left by one character, the range of the suffix sort order starting with w for any given string w , S (w) and T (w) can be obtained efficiently. When there is no suffix starting with w, S (w) = T (w).

(Genome mapping process 311)
In this processing, the genome mapping processing unit 211 performs mapping processing on the reference genome sequence data 201 for all paired-end read sequences included in the tumor sample lead sequence data 202 and the normal sample lead sequence data 203, All mapping data 212 is created. The mapping process is performed using a known method, for example, the method described in Prior Art 1.

(DP extraction process 312)
In this process, the DP extraction unit 231 selects (DP) that satisfies the inconsistency condition from all the mapping data 212. A method for determining inconsistency will be described with reference to FIG. FIG. 5 is an explanatory diagram for explaining the data format of all mapping data. Each row corresponds to a pair of read sequences, an ID for identifying the pair, source information (T or N) indicating whether the pair is derived from a tumor sample or a normal sample, 5 Consists of mapping position information of “end side and 3” end, and determination result (Yes or No) of consistency. The mapping information on the end of the lead consists of the chromosome name, the base position coordinates in the chromosome at the lead end, and the mapping direction (+ or-).

Whether or not the pair mapping result is consistent is determined as follows. The mapping position information on the 5 ′ end side and the 3 ′ end side is (cL, x, sL) and (cR, y, sR). Here, cL and cR are chromosome names, x and y are base position coordinates, and sL and sR are directions (+ or-). When cL and cR are different or sL and sR match, it is immediately determined No (inconsistent). In addition, when x <y and sL = “−” and sR = “+”, or x> y, sL = “+”, and sR = “−”, it is also determined as No (inconsistent). In other cases, the absolute value of the difference between x and y is calculated, and if the value falls within the range of L ± 3σ, it is determined as Yes (consistent), otherwise it is determined as No (inconsistent). Here, L and σ are parameters representing the average length and standard deviation of the sample DNA fragment, respectively. In this way, in the DP extraction process 312, a pair determined to be consistent is selected from all the mapping data 212.

(2D plot processing 313)
The two-dimensional plot process 313 for DP will be described with reference to FIG.

Reference numerals

601 and 602 denote genome coordinates for representing the mapping position coordinates of the 5 ′ end and 3 ′ end of the DP. Here, the genome coordinates are base position coordinates on a sequence (genome sequence) connecting all chromosomes into one. Now, assume that the genomic coordinates of the 5 ′ end L1 (603) and 3 ′ end R1 (604) of a DP are x1 (605) and y1 (606), respectively. At this time, if x1 <y1, the point (x1, y1) on the coordinate plane is made to correspond to DP (623), and if x1 ≧ y1, the point (y1, x1) on the coordinate plane is made to correspond to DP. . The reason for changing the correspondence depending on the magnitude relationship between x1 and y1 is to match the results when reading the same DNA fragment in the opposite direction.

Reference numerals

621 and 622 denote the x-axis and y-axis of the coordinate plane, and indicate the genome coordinates of each end point of DP. The position coordinates of these points 623 on the coordinate plane are temporarily stored in the two-dimensional plot data 232.

Since the length of the sample DNA fragment is almost equal to L (± 3σ), these points plotted on the coordinate plane are arranged in a band-like region extending in the direction of 45 degrees obliquely. When genome inversion occurs on one side of the BP (ie, when the orientation of the 5 ′ end and the 3 ′ end coincides in FIG. 5), the DPs are aligned at 45 degrees to the right, otherwise If it is, line up at 45 degrees. By displaying such a plot diagram in the “two-dimensional plot output of cluster C” step 342 in the flowchart of FIG. 3, the user can confirm a situation in which there are many DPs as evidence indicating the existence of BP. it can.

(Two-dimensional clustering process 314)
FIG. 7 shows details of processing operations executed in the two-dimensional clustering processing 314. First, the clustering processing unit 233 inputs the two-dimensional plot data 232 (701), projects all points on the y-axis (702), and clusters Y (1), Y (2) separated from each other by a distance L or more. ,... (703). Here, L is a parameter representing the average length of the sample DNA fragment. The following iterative process is performed for each cluster Y (j) (711, 712, 713). All points projected onto the cluster Y (j) are projected onto the x-axis (722) and classified into clusters X (1, j), X (2, j),. (723). The following repetitive processing is performed for each cluster X (i, j) (731,732,733). The set of all points projected onto X (i, j) is output as cluster C (i, j) (741). As a result, two-dimensional cluster data 234 is obtained.

FIG. 8 is an auxiliary explanatory diagram for explaining the method of the two-dimensional clustering process 314.

Reference numerals

621 and 622 denote the x-axis and y-axis of the coordinate plane representing the genome coordinates of the DP endpoints. A point 623 corresponding to all DPs is projected to a point 801 on the y-axis 622. All of these points 801 projected on the y-axis 622 are classified into clusters Y (1), Y (2),.

The point 623 projected onto the cluster Y (j) (802) is projected onto the point 811 on the x-axis 621. The whole of these points 811 projected on the x-axis 621 is classified into clusters X (1, j), X (2, j),. The entire point 623 projected onto the cluster X (i, j) (812) is output as the cluster C (i, j) (623).

(Next cluster C existence determination process 321)
In this processing, the following processing is executed for each cluster C = C (i, j), and when processing for all clusters is completed, all processing is terminated. First, the number of points included in a certain cluster C is counted, and if it is less than s, the processing proceeds to the next cluster. Here, s is a parameter for designating the minimum cluster size (322). Further, it is examined whether the DP corresponding to the point included in the cluster C is derived from the tumor sample or the normal sample. If the ratio of the tumor sample-derived DP is less than r, the process proceeds to the next cluster (323). Here, r is a parameter representing the degree of tumor specificity and is a positive number of 1 or less. Further, as described with reference to FIG. 6, the probability p that the distribution in which the plotted points are arranged at an angle of 45 degrees occurs by chance regardless of BP is calculated (324), and the value is greater than or equal to the specified parameter p0. If so, the process proceeds to the next cluster (325).

FIG. 9 is an explanatory diagram for explaining a calculation method of the probability p in which a distribution in which plot points are arranged obliquely at 45 degrees occurs by chance.

Reference numerals

621 and 622 denote the x-axis and y-axis of the coordinate plane representing the genome coordinates of the DP endpoints. A point 623 corresponding to DP displays only those included in one cluster C. A region B (901) obtained by rotating a rectangle having a vertical length of 6σ and a horizontal length of L + 6σ to the right or left by 45 degrees is taken as a position including the most points 623 in the cluster C. Also, a square area W (902) rotated by 45 degrees with a side length of L + 6σ is taken as a position including B and including the most points 623 in the cluster C. The number of points included in B is m, the total number of points included in W is n, and q = 6σ / (L + 6σ). Assuming that the points are randomly distributed without bias, m should follow the binomial distribution B (n, q). On the other hand, when a point corresponding to DP generated by BP is plotted, m takes a larger value. Therefore, the probability that m takes such a large value under the binomial distribution B (n, q) is calculated, and the probability value is set as p.

(Calculation process 326 of two-dimensional existence area M)
This process will be described with reference to FIG. FIG. 10 is an explanatory diagram for explaining a method of calculating the two-dimensional existence region

M. Reference numerals

621 and 622 denote the x-axis and y-axis of the coordinate plane representing the genome coordinates of the DP endpoints. A point 623 corresponding to DP displays only those included in one cluster C. These points 623 are assumed to correspond to the DP satisfying the conditions of sL = “+” and sR = “−” in FIG. The case where other conditions are satisfied will be described later.

Assume that the x and y coordinates of the point 623 corresponding to DP are x1 and y1. Under the above conditions, BP exists behind x1 and in front of y1, respectively. That is, if the coordinates of those BPs are x0 and y0, the following equation is established.
x0> x1 and y0 <y1 (Formula 1)
At this time, since the length of the sample DNA fragment is x0−x1 + y1−y0, this value is generally within the range of L ± 3σ.

That is, the following equation holds.
L-3σ <x0−x1 + y1−y0 <L + 3σ (Formula 2)
Here, L and σ are parameters representing the average length and standard deviation of the sample DNA fragment, respectively.

Thus, if one point (x1, y1) corresponding to DP is given, the existence area of the BP coordinates (x0, y0) is a linear inequality of x0 and y0 as shown in

equations

1 and 2. Given. The shape of the existence area defined in this way is exactly a trapezoid (because the purpose is to show the approximate position), but the exact shape has little meaning. Therefore, in FIG. 10, the expression is simplified and approximated by an elliptical region 1002. The existence area 1002 is in the direction (1001) at 45 degrees diagonally below and to the right of the point 623. Since there are a plurality of points corresponding to DP in the cluster C, a region where the overlapping of the existing regions indicated by them is the maximum is defined as a two-dimensional existing region M (1003).

The same applies when the point 623 corresponding to the DP satisfies the conditions other than sL = “+” and sR = “−” in FIG. However, when sL = “−” and sR = “+”, the BP existence region 1002 appears in the direction of 45 degrees diagonally to the upper left of the point 623 corresponding to DP, and sL = “+” and sR = “ When “+”, the BP existence region 1002 appears in the direction of 45 degrees diagonally to the upper right of the point 623 corresponding to DP, and when sL = “−” and sR = “−”, the BP existence region 1002 becomes DP. The corresponding point 623 appears in the direction of 45 degrees diagonally to the left. In any case, these overlapping regions are obtained in the same manner, and a two-dimensional existence region M is obtained.

The two-dimensional existence region M is projected in the x-axis direction and the y-axis direction to determine the respective coordinate ranges. With reference to the reference genome sequence data 201, the reference genome partial sequence A (1004) in the coordinate range Find B (1005). However, when sL = “−”, A takes a complementary strand, and when sR = “+”, B takes a complementary strand.

(M-projection destination array A and array B acquisition processing 327)
Through the above processing, the array A and the array B of M projection destinations are obtained.

(Estimation process 331 of BP position in sequence A for tumor sample)
In this process, the BP position within sequence A is estimated for the tumor sample.

(Estimation success judgment process 332)
In this process, it is determined whether or not the estimation of the BP position in the sequence A obtained for the tumor sample is successful. If the estimation fails (No), the process of the current cluster C is stopped and the process proceeds to the process 321 of the next cluster, and if not ((Yes), the process is continued.

(Estimation process 333 of BP position in sequence B for tumor sample)
In this process, the BP position within the sequence B is estimated for the tumor sample.

(Estimation success judgment process 334)
In this process, it is determined whether or not the estimation of the BP position in the sequence B obtained for the tumor sample is successful. If the estimation fails (No), the processing of the current cluster C is stopped and the processing proceeds to the processing 321 of the next cluster. If not (Yes), the processing is continued.

(Estimation process 335 of BP position in array A for normal sample)
In this process, the BP position in the array A is estimated with respect to the normal sample.

(Estimation success judgment process 336)
In this process, it is determined whether or not the estimation of the BP position in the array A obtained for the normal sample is successful. If the estimation is successful (Yes), the process of the current cluster C is stopped and the process proceeds to the process 321 of the next cluster. If not (No), the process is continued.

(A process for estimating a BP position in the sequence B for a normal sample 337)
In this process, the BP position in the array B is estimated for the normal sample.

(Estimation success judgment process 338)
In this process, it is determined whether or not the estimation of the BP position in the array B obtained for the normal sample is successful. If the estimation is successful (Yes), the process of the current cluster C is stopped and the process proceeds to the process 321 of the next cluster. If not (No), the process is continued.

(Estimation result output processing 341)
In this process, the estimation results of the positions of the BPs in the arrays A and B obtained as described above are output.

(Cluster C two-dimensional plot output process 342)
In this process, a two-dimensional plot of cluster C is output in order to show the user the situation that is the basis for the above estimation.

(D (x, A), D (x, B) plot output processing 343)
In this process, a graph plot of a multivalent function giving the minimum mismatch rate is output. Note that the plot output may be temporarily stored in the HDD (104) and later output to the display unit 103 in response to a user request. In this way, the process for the current cluster C is completed, and the process proceeds to the process 321 for the next cluster. A similar method is used for the estimation processing of the processing 331, the processing 333, the processing 335, and the processing 337.

(Details of the estimation process 331)
Hereinafter, details of the estimation process 331 will be described. If the array A and the array B in the estimation process 331 are exchanged, the contents of the estimation process 333 are obtained. If the tumor samples used in these estimation processes are replaced with normal samples, the contents of the estimation processes 335 and 337 are obtained.

FIG. 11 is a flowchart for explaining the estimation process 331 of the BP position in the array A with respect to the tumor sample.

First, the array A is scanned (1101, 1104, 1105), and a partial array Q = A (x, k) having a length k at the position of the coordinate x is taken out (1102). Next, the extended sequence Ext (Q) of the partial sequence Q is compared with the sequences A and B, and the minimum mismatch rate (minimum （discrepancy ratio) is calculated (1103).

In general, the mismatch rate d (E, F) between the sequence E and the sequence F is the base mismatch rate (ratio of bases that cause substitution, insertion, or deletion) when the sequence E and the sequence F are optimally aligned. Defined. Further, the minimum mismatch rate D (E, W) between the array E and the array W is set to the minimum value when the partial array F in the array W is selected so that d (E, F) is minimized. . These values can be calculated efficiently by using the dynamic programming described in Japanese Patent No. 5528249. Since there are a plurality of Q extended sequences Ext (Q) as described later, the minimum mismatch rate between the Q extended sequence Ext (Q) and the sequences A and B is a multivalent function. Let them be D (x, A) and D (x, B). That is, the following values are calculated (1103).
D (x, A) = D (Ext (Q), A) (Formula 3)
D (x, B) = D (Ext (Q), B) (Formula 4)
These calculation methods will be described later.

FIG. 14 is an explanatory diagram of plot diagrams of the multivalent functions D (x, A) and D (x, B). The horizontal axis 1401 is the x axis, and the vertical axis 1402 is the value of the mismatch rate D. Since these graphs are multivalent, they branch and take a plurality of D values for one x. By displaying and displaying such a graph in the “D (x, A), D (x, B) plot output” step 343 in the flowchart of FIG. It becomes possible to confirm.

Suppose y that gives the maximum value of the multivalent function D (y, A) and z that gives the maximum value of the multivalent function D (z, B) (1111). At this time, it is examined whether or not the condition that the maximum value is larger than m0, the minimum value is smaller than m1, and the absolute value of the difference between y and z is d or less is satisfied (1112). Here, m0, m1, and d are given parameters. If this condition 1112 is not satisfied, it is estimated that there is no BP in the array A (1113), and the processing is terminated because the estimation has failed (1114). On the other hand, when this condition 1112 is satisfied, if y = z (1115), the position of the BP in the array A is estimated to be y (1116), and if y ≠ z (1115), the array The position of the BP in A is estimated to be in the range between y and z (1117), and in either case, the estimation is successful (1118) and the process ends. The case where y = z does not occur occurs when some other short sequence is inserted at the position of BP.

FIG. 12 is a flowchart showing a method for obtaining the Q extended array Ext (Q). FIG. 13 is a flowchart showing the partial processing P (b, E, Q, S (EQ), T (EQ)). A method for obtaining the extended array Ext (Q) of Q will be described with reference to these drawings.

In FIG. 12, a query sequence Q is input (1201), the length Len (Q) of Q is set to k, and w is set to an empty string ε (1202). At 1203, S (w) and T (w) are calculated using the tumor sample lead sequence dictionary 222. If N is the total number of all suffixes in the read sequence of all tumor samples, first, when w = ε, as described above, S (ε) = 0, T (ε) = N Become. If S (w) <T (w), the process is immediately terminated, otherwise proceeds to the next (1204). If k> 0 (1205), the k-th character Q (k) of Q is set to n, the character string w is added to the left by adding n, and one character is extended, and k is decreased by 1 (1206), 1203 Return to the process. As described above with reference to FIG. 4, 1203 can be efficiently calculated for w extended by one character to the left.

If k = 0 (1205), E is the empty string ε (1211), b is any base of A, C, G, T, and partial processing P (b, E, Q, S (EQ) , T (EQ)) (1213) is repeated (1212), and when it is completed, the process ends. At this time, since w = Q and E = ε, the values of the arguments b, E, Q, S (EQ), and T (EQ) passed to the partial process P (1213) have been calculated. In the partial process P (1213), a plurality of extended sequences of Q may be obtained, or none may be obtained. Therefore, the number of extended arrays of Q output in the overall processing of FIG. 11 is an integer greater than or equal to 0, and Ext (Q) is multivalent.

Next, the partial process P (b, E, Q, S (EQ), T (EQ)) (1213) will be described with reference to FIG. For this partial process (1213), the arguments, b, E, Q, S (EQ), and T (EQ) values in parentheses are passed.

In 1301, add b to the left of E and extend E by one character to the left. S (EQ) and T (EQ) are calculated for the EQ extended to the left by one character (1302). Such calculation can be efficiently performed as described above with reference to FIG.

If S (EQ) <T (EQ), the process is immediately terminated, otherwise proceeds to the next (1303). If the length Len (E) of E is smaller than the given parameter e (1304), the partial processing P (b, E, Q, S (EQ), T (EQ)) (1307) is repeated (1306), and when it is completed, the process ends. Since E is extended by one character when performing partial processing recursively, the depth at which the recursion is nested is suppressed to the parameter e or less according to the determination condition of 1304 and does not fall into an infinite loop. The judgment condition of 1304 is No when Len (E) = e. At this time, E is output as one extended array of Q (1305), and the process is terminated.

As a modification of the embodiment described above, there is a method of performing mutation analysis in FIG. 1 when the normal DNA sample 111 is not given and only the tumor DNA sample 110 is given. However, both genetically inherited mutations from the parent and somatic mutations that have been acquired with canceration are analyzed. In this case, since there is no data related to the tumor sample DNA 110, the processing is simplified.

FIG. 15 shows a data flow related to the modification of the embodiment of the present invention. Reference genome sequence data 1501 is the same as reference genome sequence data 201 in FIG. 2, and sample lead sequence data 1502 is the same as tumor sample sequence data 202 in FIG. The subsequent flow of data is the same as the corresponding parts denoted by the same reference numerals in FIG.

FIG. 16 is a flowchart showing the flow of processing relating to the modification of the embodiment of the present invention. The processing of each step is the same as the corresponding part denoted by the same reference numeral in FIG.

(4) Summary As described above, when the sequence data analysis apparatus 1 (DNA analysis system) according to the present embodiment is used, the sample DNA is analyzed and the occurrence position (existing region) of an unknown event such as a breakpoint BP is highly sensitive. Can be estimated. Further, by utilizing the lead sequence dictionary (tumor lead sequence dictionary 222, normal sample lead sequence dictionary 223) and collecting and analyzing the lead sequences that pass through the occurrence position (existing region), the breakpoint BP can be efficiently analyzed. The occurrence position (existing area) of an unknown event can be estimated.

[Example 2]
(1) Outline By the way, horizontal transfer of a gene in which a part of a gene moves from one to the other may occur between different types of bacteria. In addition, even a non-pathogenic bacterium may acquire pathogenicity by receiving harmful genetic factors from other bacteria through horizontal transmission. Pathogenic bacteria may also acquire new drug resistance by receiving drug-resistant genetic factors from other bacteria. Therefore, in this example, a method for analyzing horizontal propagation of genes between bacteria will be described.

(2) Overall Configuration The basic configuration of the DNA analysis system according to this example is the same as that of Example 1 (FIG. 1). However, the DNA sequencer 109 in this embodiment analyzes a sample of a specimen infected with bacteria instead of the tumor DNA sample 110 and the normal DNA sample 111, and arranges the sample lead sequence data 1502 (FIG. 15) as the analysis result. The data analysis apparatus 1 is provided.

(3) Data Flow in Sequence Data Analysis Device With reference to FIG. 15, the data flow when the sequence data analysis device 1 is used for the application of this embodiment will be described. In FIG. 15, parts corresponding to those in FIG. The sequence data analysis apparatus 1 accepts reference sequence data 1501 from the genome sequence server 108 and accepts sample read sequence data 1502 from the DNA sequencer 109. Reference genome sequence data 1501 is a collection of reference genome sequences of a plurality of bacteria to be analyzed. The sample lead sequence data 1502 is obtained by analyzing the specimen with the DNA sequencer 109. The subsequent flow of data is the same as the corresponding parts denoted by the same reference numerals in FIG.

(4) Contents of Sequence Data Analysis Processing Hereinafter, details of the sequence data analysis processing according to the present embodiment will be described in detail with reference to FIG. In FIG. 16, the same reference numerals are given to the portions corresponding to those in FIG. 3. Therefore, the processing operation of the present embodiment is the same as the corresponding part denoted by the same reference numeral in FIG.

Suppose break points BP obtained as a result of the analysis processing according to the present embodiment are BP1, BP2, and BPt. Also, let the arrays connected by BPi be Ai and Bi. These are partial sequences of any bacterial reference genome sequence. Therefore, it is assumed that Ai is a partial sequence of the reference genome of bacteria Vi, and Bi is a partial sequence of the reference genome of bacteria Wi. If Vi and Wi are not equal, it is determined that horizontal propagation of the genetic factor has occurred between bacteria Vi and Wi. In addition, the genetic factor propagated horizontally is located adjacent to the breakpoint BP. On the other hand, if Vi = Wi for any i = 1, 2,..., T, it is determined that no horizontal propagation of genes between bacteria occurs.

(5) Summary As described above, by using the sequence data analysis apparatus 1 (DNA analysis system) according to the present embodiment, it is possible to analyze the presence or absence of horizontal propagation of genes between bacteria.

[Example 3]
(1) Outline In fields such as forensic medicine, there is a need to analyze a DNA sample and perform personal identification (HID). For personal identification, the sample DNA sequence is compared with a reference genomic sequence to determine the polymorphism (genetically inherited type) unique to that individual. Such polymorphisms include single nucleotide polymorphism (SNP), copy number variation (CNV), structural polymorphism (SV), and structural variation. If a large number of matching polymorphisms are detected in two samples, it is determined that the two DNA samples are very likely to be from the same person.

Therefore, in this embodiment, individual identification (HID) is performed by detecting polymorphisms contained in two DNA samples and comparing the detection results to determine whether or not these samples are derived from the same person. A method of performing will be described.

(2) Overall Configuration The basic configuration of the DNA analysis system according to this example is the same as that of Example 1 (FIG. 1). However, the DNA sequencer 109 of this embodiment analyzes two samples H1 and H2 instead of the tumor DNA sample 110 and the normal DNA sample 111, and sends the read sequence data 1502 as the analysis result to the sequence data analysis apparatus 1. provide. Here, the DNA sequencer 109 analyzes each of the samples H1 and H2 independently, and generates read sequence data 1502 for each sample.

(3) Data Flow in Sequence Data Analysis Device A data flow when using the sequence data analysis device 1 for the application (detection of polymorphism) of this embodiment will be described with reference to FIG. The reference genome sequence data 1501 is the same data as the reference genome sequence data 201 of FIG. 2, and the sample lead sequence 1502 is the read sequence data of the sample H1 or H2. The subsequent flow of data is the same as the corresponding parts denoted by the same reference numerals in FIG.

(4) Contents of Sequence Data Analysis Processing Hereinafter, details of the sequence data analysis processing according to the present embodiment will be described in detail with reference to FIG. In the case of the present embodiment as well, it is the same as the corresponding portion given the same reference numeral in FIG. 3 except that the samples H1 and H2 are processed. In the case of the present embodiment, estimation results of the breakpoint BP in the sample H1 and the breakpoint BP in the sample H2 are obtained as a result of the analysis process.

In general, if a sample includes CNV or SV, a breakpoint BP appears at the boundary position. Therefore, if many breakpoints BP obtained from the samples H1 and H2 are found to match each other, the possibility that the two samples are derived from the same person is extremely high. Assuming that the number of breakpoints BP obtained from the sample H1 is n1, the number of breakpoints BP obtained from the sample H2 is n2, and the number of breakpoints BP obtained in common from the samples H1 and H2 is c, the following 2 If both equations hold, it is determined that sample H1 and sample H2 are from the same person.
n1 ≧ n0 and n2 ≧ n0 (Formula 5)
c / n1 ≧ r and c / n2 ≧ r (Formula 6)

Here, n0 and r are parameters determined in advance as judgment criteria. When Expression 5 is satisfied but Expression 6 is not satisfied, it is determined that Sample H1 and Sample H2 are derived from different persons. Further, when Expression 5 is not satisfied, it is determined that determination is impossible.

(5) Summary As described above, personal identification can be performed by using the sequence data analysis apparatus 1 (DNA analysis system) according to this embodiment.

[Example 4]
(1) Overview Many examples of structural mutation (SA) related to cancer are known, and a breakpoint BP always occurs at the boundary of SA. Therefore, in this embodiment, a method for determining whether or not a known SA has occurred in a tumor sample when DNA sequencing is performed by a single end (SE) method instead of the PE method will be described.

(2) Overall Configuration The basic configuration of the DNA analysis system according to this example is the same as that of Example 1 (FIG. 1). However, the DNA sequencer 109 in this embodiment performs DNA sequencing by the SE method. Further, the normal sample 111 is not necessary for sequencing.

(3) Data Flow in Sequence Data Analysis Device With reference to FIG. 17, the data flow when the sequence data analysis device 1 is used for the application of this embodiment will be described. In FIG. 17, parts corresponding to those in FIG. Candidate sequence data A and B (1701) are sequences around the boundary of the known SA, and are taken into the sequence data analysis apparatus 1 from the input unit 105 through the sequence data input unit 1702. The sample lead sequence 1703 is read sequence data obtained by sequencing the tumor sample 110 by the SE method. The subsequent flow of data is the same as the corresponding parts denoted by the same reference numerals in FIG.

(4) Contents of Sequence Data Analysis Processing Hereinafter, details of the sequence data analysis processing according to the present embodiment will be described in detail with reference to FIG. In FIG. 18, parts corresponding to those in FIG. 3 are denoted by the same reference numerals. In this embodiment, step 1801 is executed instead of steps 311 to 314 and steps 321 to 327. Step 1801 is processing for reading the candidate sequence 1701 by the sequence data input unit 1702.

In this embodiment, a positive determination process (step 1811) is executed instead of steps 335 to 338. In this embodiment, when an estimation failure (negative result) is obtained in

steps

332 and 334, a negative determination process (step 1812) is executed. In these determination processes, the determination result is transmitted to the display unit 103. In this embodiment, since it is not necessary to arrange the two-dimensional genome reference coordinates of the same type as in the case of PE, the two-dimensional plot output (step 342) is not executed. The other processes are the same as the corresponding parts denoted by the same reference numerals in FIG.

(5) Summary As described above, when the sequence data analysis apparatus 1 (DNA analysis system) according to the present embodiment is used, if it is determined that BP exists (positive) at the boundary position of a known SA, the SA If this is not the case, it is determined that the SA has not occurred.

[Example 5]
(1) Outline Many examples are known in which a fusion gene (GF) is expressed in relation to cancer, and BP always occurs at the fusion position of GF. Therefore, in this embodiment, a method for determining whether or not a known GF is expressed in a tumor sample when cDNA sequencing of a tumor sample is performed by the single-ended (SE) method instead of the PE method. To do.

(2) Overall Configuration The basic configuration of the cDNA analysis system according to this example is the same as that of Example 1 (FIG. 1). However, in this embodiment, a tumor cDNA sample is used instead of the tumor DNA sample 110, and the DNA sequencer 109 performs sequencing of the cDNA by the SE method. Further, the normal sample 111 is not necessary.

(3) Data Flow in Sequence Data Analysis Device With reference to FIG. 17, the data flow when the sequence data analysis device 1 is used for the application of this embodiment will be described. In the case of this embodiment, candidate sequence data A and B (1701) are known sequences of two genes to be fused, and are taken into the sequence data analysis apparatus 1 from the input unit 105 through the sequence data input unit 1702. The sample lead sequence 1703 is read sequence data obtained by sequencing a tumor cDNA sample by the SE method. The subsequent flow of data is the same as the corresponding parts denoted by the same reference numerals in FIG.

(4) Contents of Sequence Data Analysis Processing Hereinafter, details of the sequence data analysis processing according to the present embodiment will be described in detail with reference to FIG. In the case of the present embodiment, in the case of the present embodiment, in step 1801, the candidate sequence 1701 is read by the sequence data input unit 1702. In step 1811 and step 1812, positive and negative determination results are transmitted to the display unit 103, respectively. The other processes are the same as the corresponding parts denoted by the same reference numerals in FIG. In this example, when it is determined that there is a breakpoint BP in each of the known sequences A and B of two genes (positive), it is determined that the fusion gene of the two genes is expressed; If so, it is determined that such a fusion gene is not expressed.

(5) Summary As described above, if the sequence data analysis apparatus 1 (DNA analysis system) according to the present embodiment is used, it can be determined whether or not a known GF is expressed in a tumor sample.

[Other embodiments]
The present invention is not limited to the above-described embodiments, and includes various modifications. For example, as shown in FIG. 19, the data analysis apparatus 1 may be realized as a part of the function of the DNA sequencer 109. That is, the DNA sequencer 109 may be configured by the data analysis device 1 and the sequencing unit 1091. Here, the sequencing unit 1091 performs the above-described sequencing operation.

The above-described embodiment has been described in detail for easy understanding of the present invention, and it is not always necessary to have all the configurations described. In addition, a part of one embodiment can be replaced with the configuration of another embodiment. Moreover, the structure of another Example can also be added to the structure of a certain Example. In addition, with respect to a part of the configuration of each embodiment, a part of the configuration of another embodiment can be added, deleted, or replaced.

In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by the processor interpreting and executing a program that realizes each function (that is, in software). Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD. Control lines and information lines indicate what is considered necessary for the description, and do not represent all control lines and information lines necessary for the product. In practice, it can be considered that almost all components are connected to each other.

1: Sequence data analysis device 108: Genome sequence server 109: DNA sequencer 211: Genome mapping processing unit 221: Read sequence dictionary creation unit 231: DP extraction unit 233: Clustering processing unit 235: Cluster evaluation unit 241: Minimum mismatch rate calculation unit 244: BP position estimation unit 251: Two-dimensional plot display output unit 252: Inconsistency rate plot display output unit 1702: Candidate sequence input unit

Claims

A data acquisition unit for acquiring sequencing data of sample DNA fragments;
A data analysis apparatus comprising: an analysis unit that analyzes the sample DNA by mapping a read sequence obtained from the sequencing data to a two-dimensional genome reference coordinate in which genome reference coordinates are arranged in two dimensions.
The data analysis apparatus according to claim 1,
The data acquisition unit acquires paired-end sequencing data as the sequencing data,
The analysis unit arranges the genome reference coordinates of the same type in two dimensions to form the two-dimensional genome reference coordinates.
The data analysis apparatus according to claim 1,
The analysis unit
The data analysis apparatus, wherein the analysis is performed by clustering the mapped lead sequences.
The data analysis apparatus according to claim 3,
The analysis unit
Estimating a region in the two-dimensional genome reference coordinates where a breakpoint of the sample DNA exists based on a distribution of points in the clustering;
A data analysis apparatus characterized by that.
The data analysis device according to claim 4,
Determining the breakpoint by querying the lead sequence dictionary for sequences contained in the region;
A data analysis apparatus characterized by that.
The data analysis device according to claim 5,
The analysis unit
Scanning the region by changing the position of the lead fragment sequence matched by the query, and determining the breakpoint by performing the sequence comparison based on the base match / mismatch of each point in the region,
A data analysis apparatus characterized by that.
The data analysis apparatus according to claim 3,
The analysis unit
Unpaired pairs are extracted by performing paired-end analysis and comparing the sequencing data to a reference genome sequence,
Executing the mapping by plotting the end point coordinates of the two lead sequences constituting the inconsistent pair on both axes of the two-dimensional genome reference coordinates;
A data analysis apparatus, wherein a genome coordinate position of a breakpoint of the sample DNA is determined based on the mapping result.
The data analysis device according to claim 7,
The analysis unit
Determining that the plurality of mismatched pairs are caused by a common breakpoint when the endpoints of the plurality of mismatched pairs in the cluster are plotted side by side at a predetermined angle in the two-dimensional genome reference coordinates. Characteristic data analysis device.
The data analysis apparatus according to claim 8, wherein
The data analysis apparatus characterized in that the predetermined angle is 45 degrees.
The data analysis device according to claim 9, wherein
The analysis unit
For each of the coordinate end points corresponding to the plurality of mismatched pairs arranged in the 45 degree direction, a candidate point in the oblique 45 degree direction separated by a predetermined distance corresponding to the fragment length of the sample DNA is calculated, A data analysis apparatus characterized by determining that the breakpoint exists in an area where a candidate point exists.
The data analysis apparatus according to claim 8, wherein
The analysis unit
A data analysis apparatus, wherein the accuracy of the determination is calculated based on a probability that randomly given points are plotted alongside the predetermined angle.
The data analysis device according to claim 7,
The analysis unit
The data analysis apparatus characterized by performing the clustering by separating the end point coordinates separated by an insert length or more.
The data analysis apparatus according to claim 3,
The analysis unit
A data analysis characterized by determining whether each lead sequence constituting the cluster is derived from a main tissue sample or a normal tissue sample, and extracting a cluster in which the breakpoint is tumor-specific based on the determination result apparatus.
The data analysis apparatus according to claim 1,
A data analysis apparatus further comprising a display unit for displaying the mapping result and the analysis result.
The data analysis apparatus according to claim 1,
The analysis unit generates position information in a one-dimensional genome reference coordinate of a lead paired with the sequence data, analyzes the position information in a two-dimensional plane in which the one-dimensional genome reference coordinate is arranged in two dimensions, A data analysis apparatus for analyzing the sample DNA.
A DNA sequencer device for sequencing sample DNA fragments and outputting the sequencing results as sequencing data;
A data acquisition unit for acquiring the sequencing data, and an analysis unit for analyzing the sample DNA by mapping a read sequence obtained from the sequencing data to a two-dimensional genome reference coordinate in which genome reference coordinates are arranged in two dimensions A data analysis device having
DNA analysis system comprising:
The data analysis system according to claim 16, comprising:
The data acquisition unit acquires paired-end sequencing data as the sequencing data,
The data analysis system, wherein the analysis unit arranges the genome reference coordinates of the same type in two dimensions to serve as the two-dimensional genome reference coordinates.
The data analysis system according to claim 16, comprising:
The data analysis system, wherein the analysis unit performs the analysis by clustering the mapped lead sequences.
The data analysis system according to claim 16, comprising:
The analysis unit generates position information in a one-dimensional genome reference coordinate of a lead paired with the sequence data, analyzes the position information in a two-dimensional plane in which the one-dimensional genome reference coordinate is arranged in two dimensions, A data analysis system for analyzing the sample DNA.
A computer having a storage unit and a calculation unit
Processing to obtain sequencing data of sample DNA fragments;
And a process of analyzing the sample DNA by mapping a read sequence obtained from the sequencing data to a two-dimensional genome reference coordinate in which genome reference coordinates are arranged two-dimensionally.