CN110942807A - Method and apparatus for detecting gene rearrangement - Google Patents

Method and apparatus for detecting gene rearrangement Download PDF

Info

Publication number
CN110942807A
CN110942807A CN201911144339.8A CN201911144339A CN110942807A CN 110942807 A CN110942807 A CN 110942807A CN 201911144339 A CN201911144339 A CN 201911144339A CN 110942807 A CN110942807 A CN 110942807A
Authority
CN
China
Prior art keywords
gene
rearrangement
pair
kmers
pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911144339.8A
Other languages
Chinese (zh)
Inventor
陈利斌
郭璟
楼峰
曹善柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiangxin Medical Technology Co Ltd
Tianjin Xiangxin Biotechnology Co Ltd
Beijing Xiangxin Biotechnology Co Ltd
Original Assignee
Beijing Xiangxin Medical Technology Co Ltd
Tianjin Xiangxin Biotechnology Co Ltd
Beijing Xiangxin Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiangxin Medical Technology Co Ltd, Tianjin Xiangxin Biotechnology Co Ltd, Beijing Xiangxin Biotechnology Co Ltd filed Critical Beijing Xiangxin Medical Technology Co Ltd
Priority to CN201911144339.8A priority Critical patent/CN110942807A/en
Publication of CN110942807A publication Critical patent/CN110942807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention provides a method and a device for detecting gene rearrangement. The method comprises the steps of obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript; constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript; constructing a second kmers hash table for each pair of reads in the sequencing data; corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the gene and read pairs; counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs. The method reduces the false positive of the detected gene rearrangement event and improves the sensitivity of the detection algorithm.

Description

Method and apparatus for detecting gene rearrangement
Technical Field
The invention relates to the field of gene sequencing data analysis, in particular to a method and a device for detecting gene rearrangement.
Background
At present, methods for detecting rearrangement based on the second-generation sequencing technology are basically the following ideas: firstly, the sequence to be compared is compared with the ginseng reference genome to obtain an abnormal comparison sequence. And determining the breakpoint position according to the alignment position and the alignment direction of the abnormal alignment sequence on the reference genome. And assembling by using the position sequence supporting the candidate breakpoint in the sequence to be compared, and reserving the breakpoint consistent with the position sequence information of the candidate breakpoint in the assembly result, thereby determining the gene rearrangement.
However, false positive gene rearrangement results are easy to occur in the existing detection method, thereby affecting the detection accuracy.
Disclosure of Invention
The invention mainly aims to provide a method and a device for detecting gene rearrangement, which aim to solve the problem that false positive is easy to occur in the gene rearrangement detection in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for detecting gene rearrangement, the method comprising: obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript; constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript; constructing a second kmers hash table for each pair of reads in the sequencing data; corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the gene and read pairs; counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.
Further, before constructing the second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.
Further, filtering the candidate gene rearrangement gene pairs comprises: filtering the candidate gene rearrangement gene pairs according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.
Further, the method further comprises annotating the targeted rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.
Further, the method comprises visually displaying the rearrangement event of the target rearranged gene.
In order to achieve the above object, according to one aspect of the present invention, there is provided an apparatus for detecting gene rearrangement, the apparatus comprising: the system comprises an acquisition module, a first construction module, a second construction module, a mapping module, a statistical screening module and a filtering module, wherein the acquisition module is used for acquiring sequencing data of a sample to be detected, a rearranged gene list to be detected and reference genome transcript information; the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript; the second construction module is used for constructing a second kmers hash table for each pair of reads in the sequencing data; the mapping module is used for corresponding to the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the pair of the genes and the read; the statistical screening module is used for counting the coverage width of read pairs in the mapping relation cluster and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.
Further, the apparatus further comprises: the complexity screening module is used for screening the complexity of each read in the sequencing data and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.
Further, the filtering module includes: a filtering unit for filtering the candidate gene rearrangement gene pair according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.
Further, the apparatus further comprises an annotation module for annotating the target rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.
Further, the device comprises a visualization module for visually displaying the rearrangement event of the target rearranged gene.
According to another aspect of the present invention, there is provided a storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform any one of the above-described methods for detecting gene rearrangement.
According to another aspect of the present invention, there is provided a processor for running a program, wherein the program when running performs any of the above-described methods for detecting gene rearrangement.
By applying the technical scheme, the method for detecting the gene rearrangement screens out candidate gene rearrangement gene pairs by a kmers-based method, and further filters the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs, so that the position of the occurrence of the gene rearrangement event is determined. The method greatly reduces false abnormal comparison read pairs (pair reads) brought by a gene comparison reference genome method, thereby reducing the false positive of the detected gene rearrangement event. In addition, the method does not assemble reads to generate a consistent sequence, and the sensitivity of the detection algorithm is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart diagram showing a method for detecting gene rearrangement in accordance with a preferred embodiment of the present invention;
FIG. 2 is a schematic configuration diagram showing an apparatus for detecting gene rearrangement in accordance with a preferred embodiment of the present invention; and
FIG. 3 is a view showing the visualized results of the apparatus for detecting gene rearrangement in the preferred embodiment according to the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
As mentioned in the background section, the conventional detection method is prone to cause false positive events to affect the accuracy of the result, and to improve the situation, the inventors have analyzed the cause of the false positive events, and found that due to homologous sequences or false genes, the sequences to be aligned are aligned to wrong positions, so that abnormal aligned sequences are introduced, and further false positive gene rearrangement events are detected. In addition, the abnormally aligned sequences require assembly, which results in low sensitivity of the algorithm.
Based on the comparison of the sequence to be compared with the reference genome, the applicant proposes an improved scheme of the present application based on the above disadvantages of the abnormal comparison sequence recognition gene rearrangement principle.
Example 1
In a preferred embodiment of the present application, there is provided a method for detecting gene rearrangement, and fig. 1 is a flowchart of a method for detecting gene rearrangement according to an embodiment of the present invention. As shown, the method includes:
step S101, obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and reference genome transcript information;
s102, constructing a first kmers hash table related to a gene list by using a rearranged gene list to be detected and the information of the human reference genome transcript;
step S103, constructing a second kmers hash table for each pair of reads in the sequencing data;
step S104, corresponding is carried out according to the first kmers hash table and the second kmers hash table, and a mapping relation cluster of the pair of the genes and the read is obtained;
step S105, counting the coverage width of read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;
and step S106, filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.
In the method, candidate gene rearrangement gene pairs are screened out by a kmers-based method, and further, the candidate gene rearrangement gene pairs are filtered to obtain target rearrangement gene pairs, so that the positions of the occurrence of the gene rearrangement events are determined. The method greatly reduces false abnormal comparison read pairs (pair reads) brought by a gene comparison reference genome method, thereby reducing the false positive of the detected gene rearrangement event. In addition, the method does not assemble reads to generate a consistent sequence, and the sensitivity of the detection algorithm is improved.
The rearranged gene list is a name list of two rearranged genes, and specifically comprises two columns of information, wherein the first column is the name of the gene at one end of the gene rearrangement, and the second column is the name of the gene at the other end of the gene rearrangement. Such as: EML4- - -ALK; TMPRSS2- - -ERG. And constructing a second Kmers hash table for each pair of reads in the sequencing data, wherein the second Kmers hash table refers to performing Kmers division on paired reads in the PE reads obtained by double-end sequencing.
And performing correspondence according to the first kmers hash table and the second kmers hash table, wherein the kmers in the two kmers hash tables are completely the same. The hash table contains two parts, a key and a value, the key is unique and no duplication is allowed. An example of a matching rule is as follows: such as: the first hash table is: the colon is a key to the left (key: meaning of key, similar to index) and a value to the right. TCG: 1; ACG: 1; CCG: 1; the second hash table is:
ACG: 1, then the corresponding result is ACG: 1.
and counting the coverage width of the read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair, wherein the coverage width refers to the number of covered genes. The case of covering only one gene is removed, for example, a read sequence is ATCGAGAGCATGA; gene A covers ATCGA; gene B covers GCATGA; in this case, 2 genes are covered, and two gene rearrangements occur. If the read covers only one gene, then no rearrangement will occur and the read is removed. A gene rearrangement gene pair refers to two genes of a rearrangement event.
In a preferred embodiment, before constructing the second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.
Complexity of Read: refers to the proportion of one read in which the base is not equal to the next. Such as: the sequence of the read is TCGAACGA, and the total comparison number is the length-1 of the read, namely 8-1 is 7; this base is not equal to the next base by 6, so the complexity is 6/7 ═ 0.857. In order to improve the accuracy of the established second kmers hash table, low complexity reads are removed first. Lower complexity means more repeated base books in kmers. If the threshold is 0.3, read with complexity <0.3 is removed.
In a preferred embodiment, filtering the candidate gene rearrangement gene pairs comprises: filtering the candidate gene rearrangement gene pairs according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.
The setting principles of the filtering conditions are as follows: 1) if the two genes constituting the gene rearrangement are from the same chromosome, the distance between the two genes needs to be ensured to be long so as to ensure that the gene rearrangement is a true positive rearrangement, and the threshold value of the invention is 100000 bp. 2) If the same gene appears in the candidate gene rearrangement frequently, the gene is likely to be homologous sequence, and false positive rearrangement can occur; the threshold value of the invention occurs 5 times. 3) To ensure that true positives for rearrangements are detected, a minimum of 2 reads are required as evidence support.
In a preferred embodiment, the above method further comprises annotating the targeted rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation. In order to provide a better interpretation of genetic counseling. HGVS: the Human Genome Variation Society (HGVS) rules aim at giving standardized naming systems for gene information, transcription information, and protein information, forming international common norms. COSMIC is an abbreviation for "cancer somatic mutation List" that encompasses the scientific literature and literature from large-scale experimental screening of the Sanger institute cancer genome project. The database is intended to collect and display information on cancer somatic mutations.
In order to more intuitively display the gene rearrangement event, in a preferred embodiment, the method further comprises visually displaying the rearrangement event of the target rearranged gene. The specific visualization means may be, for example, IGV (a genome visualization software).
Example 2
In a preferred embodiment of the present application, there is provided a more specific method of detecting gene rearrangement, the method comprising:
(1) inputting data:
1) sequencing the sample to obtain sequencing sequence data in fastq format
2) Rearranged Gene lists to be tested, TMPRSS2 and ERG
3) Genomic transcript information for ginseng test
(2) And constructing a kmers hash table related to the gene list according to the rearranged gene list to be detected and the human reference genome transcript information, and removing the situation that the same kmers is mapped to a plurality of genes. The K-mers definition is introduced below:
k-mers refers to a string of characters that divides a sequence into k bases, and sequences of length m can be divided into m-k +1 k-mers in general. All k-mers of ATGCA are as follows:
2-mers:AT,TG,GC and CA;
3-mers:ATG,TGC and GCA;
4-mers:ATGC,TGCA;
5-mers:ATGCA
this embodiment preferably employs 19-mers.
(3) For each pair of reads of sequencing data, the following processing was performed:
1) calculating complexity of reads, defining complexity: the proportion of this base in a reads that is not equal to the next base. Such as: the sequence of the read is TCGAACGA, and the total comparison number is that the length-1 of the read is 7; this base is not equal to the next base number of 6, so the complexity is 6/7 ═ 0.857. In this embodiment, the threshold of complexity is 0.3, i.e. reads with complexity <0.3 are removed.
2) A hash table is constructed from kmers for each pair of reads.
3) And (3) corresponding the read kmers hash table constructed in the last step with the kmers hash table constructed by the reference genome, and finding out the mapping corresponding relation between the genes and the read (finding out the kmers with the same two, wherein the kmers hash table constructed by the reference genome has position information, and if a segment of sequence is: ATGCGATGCAGATC, constructing a 5-mers hash table as ATGCG: 1-5; TGCGA: 2-6, etc. The position of the gene on the reference genome is fixed so that it can correspond to the gene).
4) The width of coverage of the reads pair was calculated to eliminate the case where only one gene was covered, which would not produce gene rearrangement.
5) Screening for gene rearrangement pairs with the largest coverage width.
6) Candidate gene rearrangement gene pairs are obtained.
(4) Candidate gene rearrangement gene pairs were filtered under the following conditions:
1) if the two genes of the candidate gene rearrangement gene pair are from the same chromosome, the distance between the two genes cannot be less than 100000 bp;
2) the same gene cannot be present in too many candidate gene rearrangement pairs, with a default value of 5.
3) There must be no less than 2 pairs of read pairs for the candidate gene rearrangement gene pair to support the rearrangement event.
(5) The resulting rearranged genes were annotated with HGVS and comic for better interpretation of genetic counseling.
(6) The detected gene rearrangement event is visualized.
Example 3
This example utilizes simulated data of TMPRSS2-ERG, standard sample data, and true positive sample data for testing.
Simulation data:
table 1:
sequencing Length (bp) Depth of sequencing TMPRSS2-ERG
2X 50 20x Detect out
2X 50 50x Detect out
2X 50 100x Detect out
2X 75 20x Detect out
2X 75 50x Detect out
2X 75 100x Detect out
2X 150 20x Detect out
2X 150 50x Detect out
2X 150 100x Detect out
Annotate the results:
1) ENST00000332149.5(TMPRSS2), r.1_79_ ENST00000442448.1(ERG), r.312_5034 (annotated HGVS results);
2) COSMIC ID (i.e., the ID annotated in the COSMIC database): COSF 25.
The visualization results are shown in fig. 3.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In response to the above manner, the present application also provides a device for detecting gene rearrangement, which is used to implement the above embodiments and preferred embodiments, and the description of the device is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 4
In this embodiment, there is also provided an apparatus for detecting gene rearrangement, the apparatus comprising: the system comprises an acquisition module 10, a first construction module 20, a second construction module 30, a mapping module 40, a statistical screening module 50 and a filtering module 60, wherein the acquisition module is used for acquiring sequencing data of a sample to be tested, a rearranged gene list to be tested and the information of a ginseng reference genome transcript; the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript; the second construction module is used for constructing a second kmers hash table for each pair of reads in the sequencing data; the mapping module is used for corresponding to the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of the pair of the genes and the read; the statistical screening module is used for counting the coverage width of read pairs in the mapping relation cluster and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair; and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.
In a preferred embodiment, the above apparatus further comprises: the complexity screening module is used for screening the complexity of each read in the sequencing data and reserving the read with the complexity meeting a threshold value; the threshold value is preferably a value of 0.3 or more.
In a preferred embodiment, the filtration module comprises: a filtering unit for filtering the candidate gene rearrangement gene pair according to at least one of the following conditions: 1) if two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are at least 100000bp apart; 2) the number of candidate gene rearrangement gene pairs occurring in the same gene is less than or equal to 5; 3) the candidate gene rearrangement gene pair has more than or equal to 2 pairs of read supports.
In a preferred embodiment, the apparatus further comprises an annotation module for annotating the target rearranged gene pair, preferably the annotation comprises an HGVS annotation and/or a COMSIC annotation.
In a preferred embodiment, the method further comprises a visualization module for visually displaying the rearrangement event of the target rearranged gene.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects:
1) the invention greatly reduces false abnormal comparison paired reads brought by a reference genome method based on gene comparison based on a kmers method, thereby reducing the false positive of a detected gene rearrangement event.
2) The invention can carry out HGVS annotation on the detected gene rearrangement event and can also annotate COSMIC ID.
3) The invention allows visualization of detected gene rearrangement events.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of detecting gene rearrangement, the method comprising:
obtaining sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of a ginseng reference genome transcript;
constructing a first kmers hash table related to a gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript;
constructing a second kmers hash table for each pair of reads in the sequencing data;
corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of a gene and read pair;
counting the coverage width of the read pairs in the mapping relation cluster, and screening the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;
and filtering the candidate gene rearrangement gene pair to obtain a target rearrangement gene pair.
2. The method of claim 1, wherein prior to constructing a second kmers hash table for each pair of reads in the sequencing data, the method further comprises: performing complexity screening on each read in the sequencing data, and reserving the read with the complexity meeting a threshold value; preferably, the threshold value is a value of 0.3 or more.
3. The method of claim 1, wherein filtering the candidate gene rearrangement gene pairs comprises: filtering said candidate gene rearrangement gene pairs according to at least one of the following conditions:
1) if the two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are separated by at least 100000 bp;
2) the number of the candidate gene rearrangement gene pairs occurring in the same gene is not more than 5;
3) the candidate gene rearrangement gene pair has read support of 2 pairs or more.
4. The method according to any one of claims 1 to 3, wherein the method further comprises annotating the target rearranged gene pair, preferably wherein the annotation comprises an HGVS annotation and/or a COMSIC annotation.
5. The method according to any one of claims 1 to 3, wherein the method further comprises visually displaying the rearrangement event of the target rearranged gene.
6. An apparatus for detecting gene rearrangement, the apparatus comprising:
the acquisition module is used for acquiring sequencing data of a sample to be detected, a rearranged gene list to be detected and the information of the reference genome transcript;
the first construction module is used for constructing a first kmers hash table related to the gene list by using the rearranged gene list to be detected and the information of the human reference genome transcript;
a second construction module to construct a second kmers hash table for each pair of reads in the sequencing data;
the mapping module is used for corresponding the first kmers hash table and the second kmers hash table to obtain a mapping relation cluster of a pair of genes and read;
a statistical screening module, configured to count the coverage widths of the read pairs in the mapping relationship cluster, and screen the gene rearrangement gene pair with the largest coverage width as a candidate gene rearrangement gene pair;
and the filtering module is used for filtering the candidate gene rearrangement gene pairs to obtain target rearrangement gene pairs.
7. The apparatus of claim 6, further comprising: the complexity screening module is used for carrying out complexity screening on each read in the sequencing data and reserving the read with the complexity meeting a threshold value; preferably, the threshold value is a value of 0.3 or more.
8. The apparatus of claim 6, wherein the filtering module comprises: a filtering unit that filters the candidate gene rearrangement gene pair according to at least one of the following conditions:
1) if the two genes of the candidate gene rearrangement gene pair are from the same chromosome, the two genes are separated by at least 100000 bp;
2) the number of the candidate gene rearrangement gene pairs occurring in the same gene is not more than 5;
3) the candidate gene rearrangement gene pair has read support of 2 pairs or more.
9. The apparatus according to any one of claims 6 to 8, further comprising an annotation module for annotating said pair of target rearranged genes, preferably said annotation comprises an HGVS annotation and/or a COMSIC annotation.
10. The apparatus according to any one of claims 6 to 8, further comprising a visualization module for visually displaying the rearrangement event of the target rearranged gene.
11. A storage medium comprising a stored program, wherein the program is executed to control a device on which the storage medium is located to perform the method for detecting gene rearrangement of any one of claims 1 to 5.
12. A processor configured to run a program, wherein the program when executed performs the method of detecting gene rearrangement of any one of claims 1 to 5.
CN201911144339.8A 2019-11-20 2019-11-20 Method and apparatus for detecting gene rearrangement Pending CN110942807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911144339.8A CN110942807A (en) 2019-11-20 2019-11-20 Method and apparatus for detecting gene rearrangement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911144339.8A CN110942807A (en) 2019-11-20 2019-11-20 Method and apparatus for detecting gene rearrangement

Publications (1)

Publication Number Publication Date
CN110942807A true CN110942807A (en) 2020-03-31

Family

ID=69907961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911144339.8A Pending CN110942807A (en) 2019-11-20 2019-11-20 Method and apparatus for detecting gene rearrangement

Country Status (1)

Country Link
CN (1) CN110942807A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496073A (en) * 2022-01-21 2022-05-13 至本医疗科技(上海)有限公司 Method, computing device, and computer storage medium for identifying positive rearrangements

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712672A (en) * 2018-12-29 2019-05-03 北京优迅医学检验实验室有限公司 Detect method, apparatus, storage medium and the processor of gene rearrangement

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712672A (en) * 2018-12-29 2019-05-03 北京优迅医学检验实验室有限公司 Detect method, apparatus, storage medium and the processor of gene rearrangement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHIFU CHEN 等: "GeneFuse: detection and visualization of target gene fusions from DNA sequencing data", 《INTERNATIONAL JOURNAL OF BIOLOGICAL SCIENCES》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114496073A (en) * 2022-01-21 2022-05-13 至本医疗科技(上海)有限公司 Method, computing device, and computer storage medium for identifying positive rearrangements

Similar Documents

Publication Publication Date Title
CN109086571B (en) A kind of method and system that monogenic disease hereditary variation is intelligently interpreted and reported
Gong et al. Detection of somatic structural variants from short-read next-generation sequencing data
Ji et al. RNA‐seq: Basic bioinformatics analysis
Peng et al. IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels
Clark et al. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies
Lee et al. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score
Liu et al. Single-sample landscape entropy reveals the imminent phase transition during disease progression
RU2654575C2 (en) Method for detecting chromosomal structural abnormalities and device therefor
CN107423578B (en) Device for detecting somatic cell mutation
US20210257050A1 (en) Systems and methods for using neural networks for germline and somatic variant calling
CN107103207B (en) Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method
Zhang et al. SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data
Iakovishina et al. SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability
CN107766696A (en) Eucaryote alternative splicing analysis method and system based on RNA seq data
Guo et al. Guided construction of single cell reference for human and mouse lung
Gong et al. lncRNA-screen: an interactive platform for computationally screening long non-coding RNAs in large genomics datasets
CN110020726B (en) Method and system for ordering assembly sequence
CN107292129A (en) Susceptible genotype detection method
CN112375829B (en) Method and device for identifying UPD (user Equipment) by using family WES (family WES) data and electronic equipment
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
CN111292809B (en) Method, electronic device, and computer storage medium for detecting RNA level gene fusion
CN115083521A (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
US20160132637A1 (en) Noise model to detect copy number alterations
CN110942807A (en) Method and apparatus for detecting gene rearrangement
CN111883210A (en) Single-gene disease name recommendation method and system based on clinical features and sequence variation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200331