CN111933214A

CN111933214A - Method and computing device for detecting RNA level somatic gene variation

Info

Publication number: CN111933214A
Application number: CN202011028770.9A
Authority: CN
Inventors: 王凯; 柳文进
Original assignee: Origimed Technology Shanghai Co ltd
Current assignee: Origimed Technology Shanghai Co ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2020-11-13
Anticipated expiration: 2040-09-27
Also published as: CN111933214B

Abstract

The present disclosure relates to a method, computing device, and computer storage medium for detecting somatic genetic variation at RNA levels. The method comprises the following steps: generating RNA comparison result data for the test tissue sample for determining candidate variations; generating statistical information of read lengths supporting the candidate variation in the RNA comparison result data and the DNA comparison result data of the blood sample to be detected; determining a repeat sequence characteristic of the reference genomic sequence at the location of the candidate variation to determine a predetermined filtering threshold; obtaining positive candidate variants based on the statistics of reads and the predetermined filtering threshold; and comparing the sequenced sequence comprising the positive candidate variation to a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data. The present disclosure enables accurate and efficient detection of RNA-level somatic genetic variation.

Description

Method and computing device for detecting RNA level somatic gene variation

Technical Field

The present disclosure relates generally to bioinformatics processing, and in particular, to methods, computing devices, and computer storage media for detecting RNA-level somatic genetic variations.

Background

Tumors are a disease caused by genetic variation. Accurate detection of genetic variation can guide the administration of targeted drugs, and provide guidance for immunotherapy, chemotherapy assessment, endocrine therapy efficacy assessment, and tumor typing.

Conventional methods for detecting genetic variation, for example, mainly detect genetic mutation based on the comparison of DNA sequencing data of a sample to be tested. Since RNA editing occurs after DNA transcription and then RNA can be translated into protein, not all genetic variations are transcribed, and thus there is a certain difference between the genetic variation detected based on DNA sequencing data and the actually expressed genetic variation. In addition, it is difficult to accurately detect somatic genetic variation at the RNA level because RNA is more difficult to store than DNA, and there are abundant varieties of alternative splicing in RNA.

In conclusion, the traditional scheme for detecting the gene variation has the defect that the accurate detection of the RNA level somatic gene variation is difficult.

Disclosure of Invention

The present disclosure provides a method, computing device, and computer storage medium for detecting RNA-level somatic genetic variations that can accurately and efficiently detect RNA-level somatic genetic variations.

According to a first aspect of the present disclosure, a method of detecting RNA level somatic genetic variation is provided. The method comprises the following steps: generating RNA comparison result data about the tissue sample to be tested for determining candidate variations based on the RNA comparison result data based on the comparison of the RNA sequencing result about the tissue sample to be tested of the object to be tested with the human reference genome; extracting read lengths (reads) supporting the candidate variation in the RNA comparison result data and the DNA comparison result data of the blood sample to be tested of the object to be tested so as to generate statistical information about the reads for filtering the candidate variation; determining repeat sequence features of the reference genomic sequence at locations of the candidate variations left after filtering to determine a predetermined filtering threshold; filtering the candidate variants based on the statistics of reads and a predetermined filtering threshold to obtain positive candidate variants; and comparing the sequenced sequence comprising the positive candidate variation to a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data.

According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

In some embodiments, filtering the candidate variants based on the statistics of reads and a predetermined filtering threshold comprises: filtering out the candidate variation in response to determining that at least one of the number of reads supporting the candidate variation, the alignment quality of reads, the base quality of the base of the candidate variation, the positive-negative chain ratio, and the position of the candidate variation on the reads meets a predetermined condition.

In some embodiments, filtering out candidate variations comprises: in response to determining that the candidate variation is at an end position on reads, filtering out the candidate variation.

In some embodiments, filtering out candidate variations comprises: determining the difference value of the positive and negative chain proportion of reads of the position where the candidate variation is located and the positive and negative chain proportion of reads of which the corresponding position is not mutated; in response to determining that the difference is greater than or equal to a predetermined difference threshold, filtering out candidate variations.

In some embodiments, the difference is determined based on a chi-squared test.

In some embodiments, determining the repeating sequence features of the reference genomic sequence at the locations of the candidate variations left after filtering to determine the predetermined filtering threshold comprises: determining a first filtering threshold as a predetermined filtering threshold in response to determining that the repeat sequence characteristics of the reference genomic sequence at the location of the candidate variation left by the filtering satisfy a predetermined repeat sequence condition; and in response to determining that the repeat sequence features of the reference genomic sequence at the locations of the candidate variations left by the filtering do not satisfy the predetermined repeat sequence condition, determining a second filtering threshold to be the predetermined filtering threshold, the second filtering threshold being less than the first filtering threshold.

In some embodiments, determining somatic variation result data comprises: filtering out positive candidate variations in response to determining that the sequencing sequence comprising the positive candidate variation aligns to the human complete reference genome; or filtering out positive candidate variations in response to determining that the sequenced sequence comprising the positive candidate variation does not belong to the predetermined RNA variation dataset.

In some embodiments, the method of detecting RNA level somatic genetic variation further comprises: acquiring RNA sequencing data of a tissue sample to be detected of a to-be-detected object, sequencing data of DNA of the tissue sample to be detected of the to-be-detected object and sequencing data of DNA of a blood sample to be detected; generating somatic variation data regarding the DNA level of the test subject based on the sequencing data regarding the DNA of the test tissue sample and the sequencing data regarding the DNA of the test blood sample; performing at least one of the following pre-processing on the obtained RNA sequencing data about the tissue sample to be tested so as to generate RNA sequencing results about the tissue sample to be tested: removing sequencing joints in the RNA sequencing data of the tissue sample to be detected; and removing reads with sequencing quality lower than a preset quality threshold value in the RNA sequencing data of the tissue sample to be detected.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system for implementing a method for detecting RNA level somatic genetic variations, according to an embodiment of the present disclosure.

Fig. 2 shows a flow diagram of a method for detecting RNA level somatic genetic variations, according to an embodiment of the disclosure.

Figure 3 shows a schematic comparison of RNA-level somatic genetic variation to DNA-level somatic genetic variation, according to embodiments of the present disclosure.

Fig. 4 shows a schematic distribution of RNA-level and DNA-level somatic genetic variations according to embodiments of the present disclosure.

Fig. 5 shows a flow diagram of a method for filtering candidate variations in accordance with an embodiment of the present disclosure.

Fig. 6 illustrates a flow diagram of a method for determining somatic variation result data, in accordance with an embodiment of the present disclosure.

Fig. 7 shows a flow diagram of a method for detecting RNA level somatic genetic variations, according to an embodiment of the disclosure.

FIG. 8 schematically illustrates a block diagram of an electronic device suitable for use to implement embodiments of the present disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As described above, in the conventional scheme for detecting a genetic variation, since RNA editing occurs after DNA transcription and not all genetic variations are transcribed, there is a certain difference between the detected genetic variation and the actually expressed genetic variation based on DNA sequencing data and there is abundant alternative splicing in RNA, so it is difficult to accurately detect RNA-level somatic genetic variation.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for detecting RNA-level somatic genetic variations. The scheme comprises the following steps: generating RNA comparison result data about the tissue sample to be tested for determining candidate variations based on the RNA comparison result data based on the comparison of the RNA sequencing result of the tissue sample to be tested about the object to be tested with the human reference genome; extracting read lengths (reads) supporting the candidate variation in the RNA comparison result data and the DNA comparison result data of the blood sample to be tested of the object to be tested so as to generate statistical information about the reads for filtering the candidate variation; determining repeat sequence features of the reference genomic sequence at locations of the candidate variations left after filtering to determine a predetermined filtering threshold; filtering the candidate variants based on the statistics of reads and a predetermined filtering threshold to obtain positive candidate variants; and comparing the sequenced sequence comprising the positive candidate variation to a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data.

In the scheme, candidate variation is filtered by counting the statistical information of reads supporting the candidate variation in the extracted RNA comparison result data and the corresponding DNA comparison result data; and determining the predetermined filtering threshold by determining whether a duplicate sequence is present in the reference genomic sequence at the location of the filtered candidate variation, such that the present disclosure is able to adjust the predetermined filtering threshold for situations in which PCR and sequencing errors are prone to occur. In addition, the candidate variations are filtered to determine the somatic variation result data through the statistical information and adjusted predetermined filtering threshold of reads, and the comparison results with the human reference genome and the predetermined RNA variation data set, so that the present disclosure can filter variations due to errors generated during the sequencing data comparison process. In conclusion, the present disclosure enables more accurate and efficient detection of RNA level somatic genetic variation.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for detecting RNA level somatic genetic variations, according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes: a comparison unit 110, a computing device 130, a letter generation server 120, and a network 150. The computing device 130 includes, for example, a data acquisition unit 132, a statistics unit 134 supporting variant reads, a repeated sequence feature determination unit 136, a predetermined filtering threshold determination unit 138, a positive candidate variant determination unit 140, and a somatic variant result data acquisition unit 142.

In some embodiments, the data acquisition unit 132, the statistics unit 134 supporting variant reads, the repeated sequence feature determination unit 136, the predetermined filtering threshold determination unit 138, the positive candidate variant determination unit 140, the somatic variant result data acquisition unit 142 may be configured on one or more computing devices 130; and the alignment unit 110 may be independent of the computing device 130. The computing device 130 may interact with the comparing unit 110 and the letter generation server 120 in a wired or wireless manner (e.g., the network 150).

With respect to the computing device 130, for determining candidate variations based on the RNA comparison results data; generating statistical information of read lengths supporting candidate variation in the RNA comparison result data and the DNA comparison result data of the blood sample to be detected of the object to be detected for filtering the candidate variation; determining a predetermined filtering threshold based on a repetitive sequence feature of a reference genomic sequence at a location of the candidate variation; obtaining positive candidate variants based on the statistics of reads and a predetermined filtering threshold; and comparing the sequenced sequence comprising the positive candidate variation to a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data. In some embodiments, computing device 130 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device.

And a data obtaining unit 132 for generating RNA comparison result data for the tissue sample to be tested based on the comparison of the RNA sequencing result of the tissue sample to be tested with the human reference genome for determining candidate variation based on the RNA comparison result data. The data acquisition unit 132 may acquire FASTQ format sequencing data on RNA of the tissue sample to be tested, FASTQ format sequencing data on DNA of the tissue sample to be tested, and FASTQ format sequencing data on DNA of the blood sample to be tested from the communication server 120 or the alignment unit 110 via the network 150.

A statistics unit 134 for supporting variant reads for extracting read lengths (reads) supporting candidate variants in the RNA comparison result data and in the DNA comparison result data of the test blood sample of the test subject, so as to generate statistical information on reads for filtering the candidate variants.

And a repeated sequence feature determination unit 136 for determining the repeated sequence features of the reference genomic sequence at the positions of the candidate variations left after the filtering.

With respect to the predetermined filtering threshold determination unit 138 for determining a predetermined filtering threshold based on the determined repetitive sequence features.

A positive candidate variation determination unit 140 for filtering the candidate variations based on the statistics of reads and a predetermined filtering threshold in order to obtain positive candidate variations.

And a somatic variation result data determining unit 142 for comparing the sequenced sequence comprising the positive candidate variation with a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data.

A method for detecting RNA level somatic genetic variation according to an embodiment of the present disclosure will be described below with reference to fig. 2. Fig. 2 shows a flow diagram of a method 200 for detecting RNA level somatic genetic variations, according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 800 depicted in fig. 8. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 202, the computing device 130 generates RNA comparison result data for the tissue sample to be tested based on the alignment of the RNA sequencing results for the tissue sample to be tested to the human reference genome for determining candidate variations based on the RNA comparison result data.

The RNA sequencing result of the tissue sample to be tested is, for example, FASTQ format sequencing data obtained by performing WTS sequencing on a tumor tissue sample to be tested of a tumor patient to be tested. Specifically, the computing device 130 may acquire, via the data acquisition unit 132, RNA sequencing data (e.g., RNA sequencing FASTQ file) of a tissue sample to be tested regarding a subject to be tested, and DNA sequencing data (e.g., DNA sequencing FASTQ file) of a blood sample to be tested and a tissue sample to be tested; then generating somatic variation data on the DNA level of the object to be tested based on the sequencing data on the DNA of the tissue sample to be tested and the sequencing data on the DNA of the blood sample to be tested; and preprocessing the obtained RNA sequencing data of the tissue sample to be tested so as to generate an RNA sequencing result of the tissue sample to be tested. The pretreatment to be carried out includes, for example: removing sequencing joints in the RNA sequencing data of the tissue sample to be detected; and/or removing reads with sequencing quality lower than a predetermined quality threshold from the RNA sequencing data of the tissue sample to be tested. Then, the computing device 130 compares the RNA sequencing result after the above pre-processing with a human reference genome sequence (e.g., a gene sequence of a human Hg19 standard sample) to generate RNA comparison result data (e.g., a BAM format comparison file) about the tissue sample to be tested, so as to determine candidate variation based on the comparison result data (e.g., based on BAM format comparison file mutation site information, all obtained variation sites are taken as candidate variation).

At step 204, the computing device 130 extracts reads (reads) supporting the candidate variation in the RNA comparison result data and in the DNA comparison result data of the test blood sample of the test subject in order to generate statistical information about reads for filtering the candidate variation.

The way of generating the statistics of reads includes, for example: the computing device 130 extracts reads supporting candidate variation in the bam comparison file of the RNA comparison result data on the tissue sample to be tested, and reads supporting candidate variation in the bam comparison file of the DNA comparison result data on the blood sample to be tested, and then obtains statistical information on the reads supporting candidate variation.

At step 206, the computing device 130 determines the repeating sequence features of the reference genomic sequence at the locations of the candidate variations that were filtered to leave in order to determine a predetermined filtering threshold.

The manner in which the predetermined filter threshold is determined with respect to determining the repeating sequence features of the reference genomic sequence at which the filtered candidate variation is located includes, for example: determining the first filtering threshold as a predetermined filtering threshold if the computing device 130 determines that the repeat sequence characteristic of the reference genomic sequence at the location of the filtered candidate variation satisfies a predetermined repeat sequence condition; and if the repeat sequence feature of the reference genomic sequence at the location of the filtered candidate variation is determined not to satisfy the predetermined repeat sequence condition, determining a second filtering threshold as the predetermined filtering threshold, the second filtering threshold being less than the first filtering threshold. This is because if a reference genomic sequence of a region where a mutation site is located has significant repeated sequences, such a region is relatively susceptible to sequencing errors, and therefore, if a significant repeated sequence region is detected at the current mutation site, the parameter value of the predetermined filtering threshold needs to be increased. Thus, the filtered positive results can be made more accurate.

At step 208, the computing device 130 filters the candidate variants based on the statistics of reads and a predetermined filtering threshold to obtain positive candidate variants.

Ways to filter candidate variants include, for example: the computing apparatus 130 determines whether at least one of the number of reads supporting the candidate variation, the alignment quality of the reads, the base quality of the base of the candidate variation, the positive-negative chain ratio, and the position of the candidate variation on the reads meets a predetermined condition. Filtering the candidate variation if the computing device 130 determines that at least one of a number of reads supporting the candidate variation, an alignment quality of reads, a base quality of a base of the candidate variation, a positive-negative ratio, and a position of the candidate variation on the reads meets a predetermined condition.

For example, the computing device 130 may determine whether the candidate variation is at an end position on reads; if the computing device 130 determines that the candidate variation is at an end position on reads, the candidate variation is filtered out. For another example, if the computing device 130 determines that the alignment quality of reads at the candidate variation is less than half the median alignment quality of the unmutated reads at the same site, then the candidate variation is filtered out. As another example, if the computing device 130 determines that the base quality of the base of the candidate variation is, for example, less than or equal to a predetermined base quality threshold (e.g., without limitation, 21), then the candidate variation is filtered out.

For example, the computing device 130 may determine a difference between the sign ratio of reads at the site where the candidate variation is located and the sign ratio of reads at the corresponding site that are not mutated; if the computing device 130 determines that the difference is greater than or equal to the predetermined difference threshold, candidate variations are filtered out. The predetermined variance threshold is determined, for example and without limitation, based on chi-squared tests. The method of determining the predetermined variance threshold and filtering the candidate variations based on the chi-squared test will be described below in conjunction with fig. 5. Here, the description is omitted.

At step 210, the computing apparatus 130 aligns the sequenced sequence comprising the positive candidate variation with a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data. For example, the computing device 130 re-aligns the representative sequenced sequence containing the positive candidate variation with the latest complete reference genome of human and the predetermined RNA variation dataset to filter out variations due to errors generated during the alignment of the sequenced data, thereby allowing a reliable result of somatic variation. The predetermined RNA variation dataset includes, for example, known, actually detected and predicted somatic mutation information. Thus, some variation errors generated during the alignment process that are not included in the predetermined RNA variation dataset can be filtered out.

In the above scheme, the predetermined filtering threshold is determined by filtering the candidate variants by counting statistical information of reads supporting the candidate variants in the RNA alignment result data and in the corresponding DNA alignment result data, and determining whether there is a repeated sequence in the reference genomic sequence at the position of the filtered candidate variants; and filtering the candidate variations to determine somatic variation result data by the statistics of reads and the adjusted predetermined filtering threshold, and the alignment results with the human reference genome and the predetermined RNA variation dataset, such that the present disclosure is able to filter variations due to errors generated during the alignment of sequencing data. In conclusion, the present disclosure enables more accurate and efficient detection of RNA level somatic genetic variation.

The consistency of the results of the method for detecting RNA-level somatic genetic variation and the method for detecting DNA-level somatic genetic variation of the embodiments of the present disclosure will be described below with reference to table one and fig. 3. As shown in table one, there were 94 sites in 4 samples with DNA level and RNA level somatic mutations detected by the method of the present disclosure. There was somatic genetic variation at the RNA level and no somatic genetic variation at the DNA level at the 55 sites. 177 sites present somatic genetic variation at the DNA level, while somatic genetic variation at the RNA level is absent. As shown in the Table I, there was a certain agreement between the DNA level somatic genetic variation and the RNA level somatic genetic variation. Meanwhile, some DNA-level somatic genetic variations are not expressed in RNA sequencing data, and some RNA-level somatic genetic variations that are unique to RNA sequencing data are also present.

Watch 1

As shown in table one, the ratio of the RNA-level somatic positive mutation sites detected by the method of the present disclosure to the DNA-level somatic genetic mutation positive mutation sites detected by the method of the present disclosure is high, thereby indicating that the present disclosure can accurately and efficiently detect RNA-level somatic genetic mutation.

Figure 3 shows a schematic comparison of RNA-level somatic genetic variation to DNA-level somatic genetic variation, according to embodiments of the present disclosure. Specifically, fig. 3 shows the correlation of the pearson correlation coefficient (VAF) of 94 sites of the DNA-level somatic genetic variation and RNA-level somatic genetic variation coexisting in 4 samples. The ordinate in fig. 3 indicates the mutation frequency of RNA-level somatic genetic variation at the concurrent mutation sites (i.e., 94 sites, e.g., mutation sites 310, 312, etc.); the abscissa in fig. 3 indicates the mutation frequency of DNA-level somatic genetic variation at the concurrent mutation sites (i.e., 94 sites, e.g., mutation sites 310, 312, etc.). As shown in FIG. 3, the identity coefficient Cor of 94 sites where both DNA-level and RNA-level somatic variations were present was 0.486.

Fig. 4 shows a schematic distribution of RNA-level and DNA-level somatic genetic variations according to embodiments of the present disclosure. The abscissa of fig. 4 indicates whether a characteristic mutation with respect to the DNA level includes a positive result or a non-positive result in the RNA level somatic genetic variation data. Wherein a positive result indicates that the characteristic mutation with respect to the DNA level is included in the RNA level somatic gene variation data, and a non-positive result indicates that the characteristic mutation with respect to the DNA level is not included in the RNA level somatic gene variation data. The ordinate of fig. 4 indicates the depth of coverage (coverage) of the characteristic mutation at the DNA level in the RNA data. In fig. 4, block 410 indicates that the characteristic mutation at the DNA level comprises a distribution range in the RNA level somatic genetic variation data detected by the disclosed method (i.e., a positive result). Block 420 indicates that the characteristic mutation at the DNA level is not included in the distribution range in the RNA level somatic genetic variation data detected by the disclosed method (i.e., non-positive results). Marker 412 indicates the median of the positive result coverage depth; marker 422 indicates the median of the non-positive result coverage depth. As can be seen from fig. 4, the coverage depth corresponding to the distribution range of the non-positive results is low, which indicates that: the characteristic mutations at the DNA level are not included in the RNA level somatic gene variation data detected by the methods of the disclosure because the region in which the mutation is located is transcribed little or not transcribed per se.

Fig. 5 shows a flow diagram of a method 500 for filtering candidate variations in accordance with an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 800 depicted in fig. 8. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 502, the computing device 130 determines a difference in the sign ratio of reads at the site of the candidate variation from the sign ratio of reads at the corresponding site that are not mutated.

At step 504, the computing device 130 determines whether the difference is greater than or equal to a predetermined difference threshold.

At step 506, if the computing device 130 determines that the difference is greater than or equal to the predetermined difference threshold, candidate variations are filtered out. In some embodiments, the difference is determined based on a chi-squared test, for example.

Specifically, the computing apparatus 130 may calculate the difference between the positive and negative chain proportion of reads at the site where the candidate variation is located and the positive and negative chain proportion of reads where the corresponding site is not mutated (i.e., the chi-squared value) based on the chi-squared test. For example, the computing device 130 may use a 2 x 2 cascaded table chi-square test (or called paired log data or paired quad table data chi-square test) to count mutated positive strand reads, non-mutated positive strand reads, mutated negative strand reads, and non-mutated negative strand reads, respectively. For example, the statistical approach of the chi-square test of the 2 x 2 linked list is described below in connection with table two.

Watch two

The following describes a method of calculating the chi-squared value described above with reference to formula (1), and a method of calculating the degree of freedom with reference to formula (2).

Y=n(ad-bc) ²/(a+b)(c+d)(a+c)(b+d) (1)

V=(M-1)(N-1) (2)

In the above equations (1) and (2), Y represents a chi-square value. V represents a degree of freedom. M represents the number of rows in the 2 x 2 list. N represents the number of columns in the 2 x 2 list. The degree of freedom V is, for example, 1. a. b, c, d represent the frequency of the statistic item corresponding to the four grids of the four-grid table data. n represents the sum of the frequency numbers of the statistical items corresponding to the four grids a, b, c and d. The predetermined confidence level is, for example, 95%. The present disclosure filters candidate variations by calculating, based on a chi-squared test, whether a difference between a positive-negative chain proportion of reads at a site where the candidate variation is located and a positive-negative chain proportion of reads at which a corresponding site is not mutated is less than or equal to a predetermined confidence.

At step 508, if the computing device 130 determines that the difference is less than the predetermined difference threshold, the candidate variants are retained.

By adopting the above means, the present disclosure can improve the accuracy of positive results.

Fig. 6 shows a flow diagram of a method 600 for obtaining somatic variation result data, in accordance with an embodiment of the present disclosure. It should be understood that the method 600 may be performed, for example, at the electronic device 800 depicted in fig. 8. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 600 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 602, the computing device 130 determines whether the sequenced sequence comprising the positive candidate variation aligns with a human complete reference genome.

If the computing device 130 determines that the sequenced sequence comprising the positive candidate variation aligns to the human complete reference genome, at step 606, the positive candidate variation is filtered out.

At step 604, the computing device 130 determines whether the sequenced sequence comprising the positive candidate variation does not belong to the predetermined RNA variation dataset.

If the sequencing sequence of the computing device 130 that contains a positive candidate variation does not belong to the predetermined RNA variation dataset, at step 606, the positive candidate variation is filtered out. The predetermined RNA variation dataset includes, for example, known actually detected and predicted somatic mutation information.

At step 608, the computing device 130 obtains somatic variation result data based on the remaining positive candidate variations.

In the scheme, by comparing the representative sequencing sequence containing the positive candidate variation with the human complete reference genome and the predetermined RNA variation data set again, some errors generated in the comparison process can be filtered out, and the reliability of RNA level somatic variation result data is improved.

Fig. 7 shows a flow diagram of a method 700 for detecting somatic genetic variations at RNA levels according to an embodiment of the disclosure. It should be understood that method 700 may be performed, for example, at electronic device 800 depicted in fig. 8. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 700 may also include additional acts not shown and/or may omit acts shown, as the scope of the present disclosure is not limited in this respect.

At step 702, sequencing is performed on a tissue sample to be tested and a matched blood sample to be tested of a test object so as to obtain an RNA sequencing result on the tissue sample to be tested, a DNA sequencing result on the tissue sample to be tested and a DNA sequencing result on the blood sample to be tested, respectively. For example, tumor tissue samples and paired blood samples of 4 tumor patients can be obtained, and Whole Exon Sequencing (WES) of DNA and Whole Transcriptome Sequencing (WTS) can be performed on the tumor tissue samples and blood samples, respectively, to obtain FASTQ format Sequencing data on RNA, FASTQ format Sequencing data on DNA, and FASTQ format Sequencing data on DNA of the blood samples, respectively. In the above-mentioned methods, the purpose of obtaining the DNA sequencing result of the blood sample to be tested is to distinguish which mutations are somatic mutations and which are genetic mutations.

At step 704, a region file, i.e., a probe file, is prepared for sequencing the region covering the genome, and a file of the coded sequence (CDS) region covered by the probe file is obtained. It should be understood that in the probe design process, the Tm of the probe, the length of the probe, the GC content, the secondary structure of the probe, the complexity, the direction of the probe, the number of the probes, and the like are comprehensively considered, so as to enhance the capture efficiency and uniformity of the probe.

At step 706, the computing device 130 obtains somatic variation data for the DNA level of the test subject based on the sequencing results for the DNA of the tissue sample to be tested and the sequencing results for the DNA of the blood sample to be tested. Through the above-described processing, the computing apparatus 130 finds which sites the somatic variation data at the DNA level exists, so as to be used for detecting or verifying the accuracy of the somatic mutation data at the detected RNA level.

At step 708, the computing device 130 pre-processes the RNA sequencing results for the tissue sample to be tested. For example, the RNA level sequencing data for the test subject is pre-processed using AdapterRemoval software. The pretreatment to be carried out includes, for example: the sequencing adapters were removed and low quality reads were filtered out. For example, FASTQ format sequencing data of the RNA of the tissue sample to be tested obtained at step 702 is preprocessed to remove sequencing junctions, filter out low quality reads, etc.

At step 710, the computing device 130 generates RNA comparison result data for the tissue sample to be tested based on the alignment of the RNA sequencing results for the tissue sample to be tested to the human reference genome for determining candidate variations based on the comparison result data. For example, the computing device 130 aligns the RNA sequencing results of the tissue sample to be tested to the Hg19 reference genome to obtain an alignment result file in BAM format; the computing device 130 then analyzes the comparison result file in BAM format, and uses all the obtained mutation sites as candidate mutations.

At step 712, the computing device 130 extracts reads (reads) supporting the candidate variation in the RNA comparison result data and the DNA comparison result data corresponding to the blood sample to be tested in order to generate statistical information about the reads for filtering the candidate variation. For example, the computing device 130 extracts reads in the RNA bam format comparison result of the tumor tissue sample and reads in the DNA bam format comparison result of the blood sample within the range of the variation to generate statistical information about the reads, then compares the statistical information about the reads in which the variation of the subject is located with the statistical information about the reads in which the corresponding site of the reference genome is not mutated, filters the candidate variation if the comparison result indicates that there is a significant difference between the two, and retains the candidate variation if the comparison result indicates that there is no significant difference between the two. Thus, the present disclosure can eliminate mutational noise due to sequencing errors and sequencing artifact.

At step 714, if the computing device 130 determines that at least one of the number of reads that support the candidate variation, the alignment quality of reads, the base quality of the base of the candidate variation, the positive-to-negative ratio, and the position of the candidate variation on reads meets a predetermined condition, the candidate variation is filtered out.

For example, if the computing device 130 determines that the number of reads of the supported candidate variations, the alignment quality of the reads, the base quality of the bases of the variant candidate variations, the sign ratio, and the position of the candidate variation on the reads are significantly below the corresponding predetermined thresholds, variant sites that are significantly below the predetermined thresholds are filtered out. For example, if the computing device 130 determines that the candidate variation is at an end position on reads, the candidate variation is filtered out. Alternatively, if the computing device 130 determines that the alignment quality of reads at the candidate variation is less than half the median alignment quality of the unmutated reads at the same site, then the candidate variation is filtered out. If the computing device 130 determines that the base quality of the base of the candidate variation is, for example, less than or equal to 21, then the candidate variation is filtered out. For another example, the computing device 130 may determine a difference between the sign ratio of reads at the site where the candidate variation is located and the sign ratio of reads at the corresponding site where no mutation occurs; if the computing device 130 determines that the difference is greater than or equal to the predetermined difference threshold, then candidate variants are filtered out. The difference is determined, for example and without limitation, based on chi-squared tests. The candidate variation is filtered by calculating whether the difference value between the positive and negative chain proportion of reads of the position where the candidate variation is located and the positive and negative chain proportion of reads of which the corresponding position is not mutated is smaller than or equal to a preset confidence coefficient based on the chi-square test, so that the accuracy of a positive result can be improved.

At step 716, the computing device 130 determines a repeat sequence characteristic of the reference genomic sequence at the location of the filtered candidate variation to determine a predetermined filtering threshold.

For example, the computing device 130 analyzes the sequence characteristics of the reference genome at the location of the mutation, and determines whether the sequence at the mutation is susceptible to Polymerase Chain Reaction (PCR) and sequencing errors, thereby performing further filtering.

At step 718, the computing device 130 aligns the sequenced sequence comprising the positive candidate variation with a human reference genome and a predetermined RNA variation dataset to obtain somatic variation result data.

By adopting the means, the method can detect RNA level somatic gene variation more accurately and efficiently.

FIG. 8 schematically illustrates a block diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure. The apparatus 800 may be an apparatus for implementing the

methods

200, 500 to 700 shown in fig. 2, 5 to 7. As shown in fig. 8, device 800 includes a Central Processing Unit (CPU) 801 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 802 or loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The CPU 801, ROM 802, and RAM803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, a processing unit 801 perform the respective methods and processes described above, for example, perform the

methods

200, 500 to 700. For example, in some embodiments, the

methods

200, 500, through 700 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM803 and executed by CPU 801, a computer program may perform one or more of the operations of

methods

200, 500 through 700 described above. Alternatively, in other embodiments, CPU 801 may be configured to perform one or more acts of methods 200, 500-700 by any other suitable means (e.g., by way of firmware).

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are merely alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for detecting somatic genetic variation at RNA levels comprising:

generating RNA comparison result data for a tissue sample to be tested based on an RNA sequencing result of the tissue sample to be tested for a subject to be tested and a human reference genome for determining candidate variations based on the RNA comparison result data;

extracting reads (reads) supporting the candidate variation in the RNA comparison result data and in DNA comparison result data of a test blood sample of the test subject, so as to generate statistical information about the reads for filtering the candidate variation;

determining repeat sequence features of the reference genomic sequence at locations of the candidate variations left after filtering to determine a predetermined filtering threshold;

filtering the candidate variants based on the statistics of the reads and the predetermined filtering threshold to obtain positive candidate variants; and

comparing the sequenced sequence comprising the positive candidate variation to a human reference genome and a predetermined RNA variation dataset to determine somatic variation result data.

2. The method of claim 1, wherein filtering the candidate variants based on statistics of the reads and the predetermined filtering threshold comprises:

filtering the candidate variation in response to determining that at least one of a number of reads supporting the candidate variation, an alignment quality of the reads, a base quality of a base of the candidate variation, a plus-minus ratio, and a position of the candidate variation on reads meets a predetermined condition.

3. The method of claim 2, wherein filtering out the candidate variations comprises:

filtering out the candidate variation in response to determining that the candidate variation is at an end position on reads.

4. The method of claim 2, wherein filtering out the candidate variations comprises:

determining the difference value of the positive and negative chain proportion of reads of the position where the candidate variation is located and the positive and negative chain proportion of reads of which the corresponding position is not mutated;

filtering the candidate variations in response to determining that the difference is greater than or equal to a predetermined difference threshold.

5. The method of claim 4, wherein the difference is determined based on a Chi-squared test.

6. The method of claim 1, wherein determining repeating sequence features of a reference genomic sequence at locations of candidate variations left after filtering to determine a predetermined filtering threshold comprises:

determining a first filtering threshold as a predetermined filtering threshold in response to determining that the repeat sequence characteristics of the reference genomic sequence at the location of the candidate variation left by the filtering satisfy a predetermined repeat sequence condition; and

determining a second filtering threshold to be a predetermined filtering threshold in response to determining that the repeating sequence features of the reference genomic sequence at the locations of the candidate variations left by filtering do not satisfy the predetermined repeating sequence condition, the second filtering threshold being less than the first filtering threshold.

7. The method of claim 1, wherein determining somatic variation outcome data comprises:

filtering out the positive candidate variation in response to determining that the sequenced sequence comprising the positive candidate variation aligns with the human complete reference genome; or

Filtering out the positive candidate variations in response to determining that the sequenced sequence comprising the positive candidate variations does not belong to the predetermined RNA variation dataset.

8. The method of claim 1, further comprising:

obtaining RNA sequencing data of a tissue sample to be tested of the object to be tested, sequencing data of DNA of the tissue sample to be tested of the object to be tested and sequencing data of DNA of a blood sample to be tested;

generating somatic variation data regarding the DNA level of the test subject based on the sequencing data regarding the DNA of the test tissue sample and the sequencing data regarding the DNA of the test blood sample;

pre-processing the obtained RNA sequencing data about the tissue sample to be tested to generate RNA sequencing results about the tissue sample to be tested by at least one of:

removing sequencing joints in the RNA sequencing data of the tissue sample to be detected; and

and removing reads with sequencing quality lower than a preset quality threshold value in the RNA sequencing data of the tissue sample to be detected.

9. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-8.