CN110942807A

CN110942807A - Method and apparatus for detecting gene rearrangement

Info

Publication number: CN110942807A
Application number: CN201911144339.8A
Authority: CN
Inventors: 陈利斌; 郭璟; 楼峰; 曹善柏
Original assignee: Beijing Xiangxin Medical Technology Co Ltd; Tianjin Xiangxin Biotechnology Co Ltd; Beijing Xiangxin Biotechnology Co Ltd
Current assignee: Beijing Xiangxin Medical Technology Co Ltd; Tianjin Xiangxin Biotechnology Co Ltd; Beijing Xiangxin Biotechnology Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-03-31

Abstract

The invention provides a method and a device for detecting gene rearrangement. The method comprises the steps of obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript; constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript; constructing a second kmers hash table for each pair of reads in the sequencing data; corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the gene and read pairs; counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs. The method reduces the false positive of the detected gene rearrangement event and improves the sensitivity of the detection algorithm.

Description

Method and apparatus for detecting gene rearrangement

Technical Field

The invention relates to the field of gene sequencing data analysis, in particular to a method and a device for detecting gene rearrangement.

Background

At present, methods for detecting rearrangement based on the second-generation sequencing technology are basically the following ideas: firstly, the sequence to be compared is compared with the ginseng reference genome to obtain an abnormal comparison sequence. And determining the breakpoint position according to the alignment position and the alignment direction of the abnormal alignment sequence on the reference genome. And assembling by using the position sequence supporting the candidate breakpoint in the sequence to be compared, and reserving the breakpoint consistent with the position sequence information of the candidate breakpoint in the assembly result, thereby determining the gene rearrangement.

However, false positive gene rearrangement results are easy to occur in the existing detection method, thereby affecting the detection accuracy.

Disclosure of Invention

The invention mainly aims to provide a method and a device for detecting gene rearrangement, which aim to solve the problem that false positive is easy to occur in the gene rearrangement detection in the prior art.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting gene rearrangement, the method comprising: obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript; constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript; constructing a second kmers hash table for each pair of reads in the sequencing data; corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the gene and read pairs; counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.

Further, before constructing the second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.

Further, filtering the candidate gene rearrangement gene pairs comprises: filtering the candidate gene rearrangement gene pairs according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.

Further, the method further comprises annotating the targeted rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.

Further, the method comprises visually displaying the rearrangement event of the target rearranged gene.

In order to achieve the above object, according to one aspect of the present invention, there is provided an apparatus for detecting gene rearrangement, the apparatus comprising: the system comprises an acquisition module, a first construction module, a second construction module, a mapping module, a statistical screening module and a filtering module, wherein the acquisition module is used for acquiring sequencing data of a sample to be detected, a rearranged gene list to be detected and reference genome transcript information; the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript; the second construction module is used for constructing a second kmers hash table for each pair of reads in the sequencing data; the mapping module is used for corresponding to the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the pair of the genes and the read; the statistical screening module is used for counting the coverage width of read pairs in the mapping relation cluster and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.

Further, the apparatus further comprises: the complexity screening module is used for screening the complexity of each read in the sequencing data and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.

Further, the filtering module includes: a filtering unit for filtering the candidate gene rearrangement gene pair according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.

Further, the apparatus further comprises an annotation module for annotating the target rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.

Further, the device comprises a visualization module for visually displaying the rearrangement event of the target rearranged gene.

According to another aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform any one of the above-described methods for detecting gene rearrangement.

According to another aspect of the present invention, there is provided a processor for running a program, wherein the program when running performs any of the above-described methods for detecting gene rearrangement.

By applying the technical scheme, the method for detecting the gene rearrangement screens out candidate gene rearrangement gene pairs by a kmers-based method, and further filters the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs, so that the position of the occurrence of the gene rearrangement event is determined. The method greatly reduces false abnormal comparison read pairs (pair reads) brought by a gene comparison reference genome method, thereby reducing the false positive of the detected gene rearrangement event. In addition, the method does not assemble reads to generate a consistent sequence, and the sensitivity of the detection algorithm is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic flow chart diagram showing a method for detecting gene rearrangement in accordance with a preferred embodiment of the present invention;

FIG. 2 is a schematic configuration diagram showing an apparatus for detecting gene rearrangement in accordance with a preferred embodiment of the present invention; and

FIG. 3 is a view showing the visualized results of the apparatus for detecting gene rearrangement in the preferred embodiment according to the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.

As mentioned in the background section, the conventional detection method is prone to cause false positive events to affect the accuracy of the result, and to improve the situation, the inventors have analyzed the cause of the false positive events, and found that due to homologous sequences or false genes, the sequences to be aligned are aligned to wrong positions, so that abnormal aligned sequences are introduced, and further false positive gene rearrangement events are detected. In addition, the abnormally aligned sequences require assembly, which results in low sensitivity of the algorithm.

Based on the comparison of the sequence to be compared with the reference genome, the applicant proposes an improved scheme of the present application based on the above disadvantages of the abnormal comparison sequence recognition gene rearrangement principle.

Example 1

In a preferred embodiment of the present application, there is provided a method for detecting gene rearrangement, and fig. 1 is a flowchart of a method for detecting gene rearrangement according to an embodiment of the present invention. As shown, the method includes:

step S101, obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and reference genome transcript information;

s102, constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript;

step S103, constructing a second kmers hash table for each pair of reads in the sequencing data;

step S104, corresponding is carried out according to the first kmers hash table and the second kmers hash table, and a mapping relation cluster of the pair of the genes and the read is obtained;

step S105, counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;

and step S106, filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.

In the method, candidate gene rearrangement gene pairs are screened out by a kmers-based method, and further, the candidate gene rearrangement gene pairs are filtered to obtain target rearrangement gene pairs, so that the positions of the occurrence of the gene rearrangement events are determined. The method greatly reduces false abnormal comparison read pairs (pair reads) brought by a gene comparison reference genome method, thereby reducing the false positive of the detected gene rearrangement event. In addition, the method does not assemble reads to generate a consistent sequence, and the sensitivity of the detection algorithm is improved.

The rearranged gene list is a name list of two rearranged genes, and specifically comprises two columns of information, wherein the first column is the name of the gene at one end of the gene rearrangement, and the second column is the name of the gene at the other end of the gene rearrangement. Such as: EML4- - -ALK; TMPRSS2- - -ERG. And constructing a second Kmers hash table for each pair of reads in the sequencing data, wherein the second Kmers hash table refers to performing Kmers division on paired reads in the PE reads obtained by double-end sequencing.

And performing correspondence according to the first kmers hash table and the second kmers hash table, wherein the kmers in the two kmers hash tables are completely the same. The hash table contains two parts, a key and a value, the key is unique and no duplication is allowed. An example of a matching rule is as follows: such as: the first hash table is: the colon is a key to the left (key: meaning of key, similar to index) and a value to the right. TCG: 1; ACG: 1; CCG: 1; the second hash table is:

ACG: 1, then the corresponding result is ACG: 1.

and counting the coverage width of the read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair, wherein the coverage width refers to the number of covered genes. The case of covering only one gene is removed, for example, a read sequence is ATCGAGAGCATGA; gene A covers ATCGA; gene B covers GCATGA; in this case, 2 genes are covered, and two gene rearrangements occur. If the read covers only one gene, then no rearrangement will occur and the read is removed. A gene rearrangement gene pair refers to two genes of a rearrangement event.

In a preferred embodiment, before constructing the second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.

Complexity of Read: refers to the proportion of one read in which the base is not equal to the next. Such as: the sequence of the read is TCGAACGA, and the total comparison number is the length-1 of the read, namely 8-1 is 7; this base is not equal to the next base by 6, so the complexity is 6/7 ═ 0.857. In order to improve the accuracy of the established second kmers hash table, low complexity reads are removed first. Lower complexity means more repeated base books in kmers. If the threshold is 0.3, read with complexity <0.3 is removed.

In a preferred embodiment, filtering the candidate gene rearrangement gene pairs comprises: filtering the candidate gene rearrangement gene pairs according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.

The setting principles of the filtering conditions are as follows: 1) if the two genes constituting the gene rearrangement are from the same chromosome, the distance between the two genes needs to be ensured to be long so as to ensure that the gene rearrangement is a true positive rearrangement, and the threshold value of the invention is 100000 bp. 2) If the same gene appears in the candidate gene rearrangement frequently, the gene is likely to be homologous sequence, and false positive rearrangement can occur; the threshold value of the invention occurs 5 times. 3) To ensure that true positives for rearrangements are detected, a minimum of 2 reads are required as evidence support.

In a preferred embodiment, the above method further comprises annotating the targeted rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation. In order to provide a better interpretation of genetic counseling. HGVS: the Human Genome Variation Society (HGVS) rules aim at giving standardized naming systems for gene information, transcription information, and protein information, forming international common norms. COSMIC is an abbreviation for "cancer somatic mutation List" that encompasses the scientific literature and literature from large-scale experimental screening of the Sanger institute cancer genome project. The database is intended to collect and display information on cancer somatic mutations.

In order to more intuitively display the gene rearrangement event, in a preferred embodiment, the method further comprises visually displaying the rearrangement event of the target rearranged gene. The specific visualization means may be, for example, IGV (a genome visualization software).

Example 2

In a preferred embodiment of the present application, there is provided a more specific method of detecting gene rearrangement, the method comprising:

(1) inputting data:

1) sequencing the sample to obtain sequencing sequence data in fastq format

2) Rearranged Gene lists to be tested, TMPRSS2 and ERG

3) Genomic transcript information for ginseng test

(2) And constructing a kmers hash table related to the gene list according to the rearranged gene list to be detected and the human reference genome transcript information, and removing the situation that the same kmers is mapped to a plurality of genes. The K-mers definition is introduced below:

k-mers refers to a string of characters that divides a sequence into k bases, and sequences of length m can be divided into m-k +1 k-mers in general. All k-mers of ATGCA are as follows:

2-mers:AT,TG,GC and CA；

3-mers:ATG,TGC and GCA；

4-mers:ATGC,TGCA；

5-mers:ATGCA

this embodiment preferably employs 19-mers.

(3) For each pair of reads of sequencing data, the following processing was performed:

1) calculating complexity of reads, defining complexity: the proportion of this base in a reads that is not equal to the next base. Such as: the sequence of the read is TCGAACGA, and the total comparison number is that the length-1 of the read is 7; this base is not equal to the next base number of 6, so the complexity is 6/7 ═ 0.857. In this embodiment, the threshold of complexity is 0.3, i.e. reads with complexity <0.3 are removed.

2) A hash table is constructed from kmers for each pair of reads.

3) And (3) corresponding the read kmers hash table constructed in the last step with the kmers hash table constructed by the reference genome, and finding out the mapping corresponding relation between the genes and the read (finding out the kmers with the same two, wherein the kmers hash table constructed by the reference genome has position information, and if a segment of sequence is: ATGCGATGCAGATC, constructing a 5-mers hash table as ATGCG: 1-5; TGCGA: 2-6, etc. The position of the gene on the reference genome is fixed so that it can correspond to the gene).

4) The width of coverage of the reads pair was calculated to eliminate the case where only one gene was covered, which would not produce gene rearrangement.

5) Screening for gene rearrangement pairs with the largest coverage width.

6) Candidate gene rearrangement gene pairs are obtained.

(4) Candidate gene rearrangement gene pairs were filtered under the following conditions:

1) if the two genes of the candidate gene rearrangement gene pair are from the same chromosome, the distance between the two genes cannot be less than 100000 bp;

2) the same gene cannot be present in too many candidate gene rearrangement pairs, with a default value of 5.

3) There must be no less than 2 pairs of read pairs for the candidate gene rearrangement gene pair to support the rearrangement event.

(5) The resulting rearranged genes were annotated with HGVS and comic for better interpretation of genetic counseling.

(6) The detected gene rearrangement event is visualized.

Example 3

This example utilizes simulated data of TMPRSS2-ERG, standard sample data, and true positive sample data for testing.

Simulation data:

table 1:

sequencing Length (bp)	Depth of sequencing	TMPRSS2-ERG
			2X 50	20x	Detect out
2X 50	50x	Detect out
			2X 50	100x	Detect out
2X 75	20x	Detect out
			2X 75	50x	Detect out
2X 75	100x	Detect out
			2X 150	20x	Detect out
2X 150	50x	Detect out
			2X 150	100x	Detect out

Annotate the results:

1) ENST00000332149.5(TMPRSS2), r.1_79_ ENST00000442448.1(ERG), r.312_5034 (annotated HGVS results);

2) COSMIC ID (i.e., the ID annotated in the COSMIC database): COSF 25.

The visualization results are shown in fig. 3.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In response to the above manner, the present application also provides a device for detecting gene rearrangement, which is used to implement the above embodiments and preferred embodiments, and the description of the device is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

This is further illustrated below in connection with alternative embodiments.

Example 4

In this embodiment, there is also provided an apparatus for detecting gene rearrangement, the apparatus comprising: the system comprises an acquisition module 10, a first construction module 20, a second construction module 30, a mapping module 40, a statistical screening module 50 and a filtering module 60, wherein the acquisition module is used for acquiring sequencing data of a sample to be tested, a rearranged gene list to be tested and the information of a ginseng reference genome transcript; the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript; the second construction module is used for constructing a second kmers hash table for each pair of reads in the sequencing data; the mapping module is used for corresponding to the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the pair of the genes and the read; the statistical screening module is used for counting the coverage width of read pairs in the mapping relation cluster and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.

In a preferred embodiment, the above apparatus further comprises: the complexity screening module is used for screening the complexity of each read in the sequencing data and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.

In a preferred embodiment, the filtration module comprises: a filtering unit for filtering the candidate gene rearrangement gene pair according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.

In a preferred embodiment, the apparatus further comprises an annotation module for annotating the target rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.

In a preferred embodiment, the method further comprises a visualization module for visually displaying the rearrangement event of the target rearranged gene.

From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects:

1) the invention greatly reduces false abnormal comparison paired reads brought by a reference genome method based on gene comparison based on a kmers method, thereby reducing the false positive of a detected gene rearrangement event.

2) The invention can carry out HGVS annotation on the detected gene rearrangement event and can also annotate COSMIC ID.

3) The invention allows visualization of detected gene rearrangement events.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of detecting gene rearrangement, the method comprising:

obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript;

constructing a first kmers hash table related to a gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript;

constructing a second kmers hash table for each pair of reads in the sequencing data;

corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of a gene and read pair;

counting the coverage width of the read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;

and filtering the candidate gene rearrangement gene pair to obtain a target rearrangement gene pair.

2. The method of claim 1, wherein prior to constructing a second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; preferably, the threshold value is a value of 0.3 or more.

3. The method of claim 1, wherein filtering the candidate gene rearrangement gene pairs comprises: filtering said candidate gene rearrangement gene pairs according to at least one of the following conditions:

1) if the two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are separated by at least 100000 bp;

2) the number of the candidate gene rearrangement gene pairs occurring in the same gene is not more than 5;

3) the candidate gene rearrangement gene pair has read support of 2 pairs or more.

4. The method according to any one of claims 1 to 3, wherein the method further comprises annotating the target rearranged gene pair, preferably wherein the annotation comprises an HGVS annotation and/or a COMSIC annotation.

5. The method according to any one of claims 1 to 3, wherein the method further comprises visually displaying the rearrangement event of the target rearranged gene.

6. An apparatus for detecting gene rearrangement, the apparatus comprising:

the acquisition module is used for acquiring sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of the reference genome transcript;

the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript;

a second construction module to construct a second kmers hash table for each pair of reads in the sequencing data;

the mapping module is used for corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of a pair of genes and read;

a statistical screening module, configured to count the coverage widths of the read pairs in the mapping relationship cluster, and screen the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;

and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.

7. The apparatus of claim 6, further comprising: the complexity screening module is used for carrying out complexity screening on each read in the sequencing data and reserving the read with the complexity meeting a threshold value; preferably, the threshold value is a value of 0.3 or more.

8. The apparatus of claim 6, wherein the filtering module comprises: a filtering unit that filters the candidate gene rearrangement gene pair according to at least one of the following conditions:

9. The apparatus according to any one of claims 6 to 8, further comprising an annotation module for annotating said pair of target rearranged genes, preferably said annotation comprises an HGVS annotation and/or a COMSIC annotation.

10. The apparatus according to any one of claims 6 to 8, further comprising a visualization module for visually displaying the rearrangement event of the target rearranged gene.

11. A storage medium comprising a stored program, wherein the program is executed to control a device on which the storage medium is located to perform the method for detecting gene rearrangement of any one of claims 1 to 5.

12. A processor configured to run a program, wherein the program when executed performs the method of detecting gene rearrangement of any one of claims 1 to 5.