CN114927191A

CN114927191A - Interpretation method for NGS report of blood system disease

Info

Publication number: CN114927191A
Application number: CN202210385521.8A
Authority: CN
Inventors: 付海阔; 王奇隆; 舒金才; 陈金雄; 尚华
Original assignee: Beijing Gaolingzhiteng Information Technology Co ltd
Current assignee: Beijing Gaolingzhiteng Information Technology Co ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-08-19
Anticipated expiration: 2042-04-13
Also published as: CN114927191B

Abstract

The invention relates to a reading method of an NGS report of a hematological disease, which belongs to the field of letter reading and comprises the following steps: s1: uploading a result data vcf file of the credit generation analysis; s2: judging whether the uploaded file is a comment file, if so, executing the step S4, otherwise, executing the step S3; s3: obtaining relevant filtering conditions of the detection items corresponding to the samples, carrying out filtering processing and generating annotation files; s4: reading the mutation site information in the annotation file and storing the mutation site information in a structured manner; s5: matching variation site information in a knowledge base; s6: acquiring report data; s7: acquiring a report template according to a submission unit; s8: a report is generated and exported. The invention realizes convenient data management and rapid report output, helps users realize intelligent management of gene data, comprehensively improves the report reading capability and greatly improves the efficiency.

Description

Interpretation method for NGS report of blood system disease

Technical Field

The invention belongs to the field of letter interpretation and relates to a blood system disease NGS report interpretation method.

Background

The high throughput sequencing technology (NGS) can perform sequencing and general reading length detection on hundreds of thousands to millions of DNA molecules in parallel at one time, and the like, and complete sequence information is spliced by reading a plurality of short DNA fragments. The NGS messaging process can be divided into three levels: first-level information analysis-conversion of offboard raw data (BCL format) to readable data (VCF format); secondary information analysis-site annotation filtering for VCF data, etc.; and (3) three-level information analysis, namely, combining the clinical diagnosis and treatment conditions of the patient to carry out clinical significance interpretation on the mutant gene locus. Where report interpretation is the last and most important link. The interpretation of the report of the second generation sequencing needs to search a large amount of databases and professional documents, and the problems of large data volume, complex operation, difficult query and the like are faced, and one tumor report is completely interpreted manually for about 6 hours.

Disclosure of Invention

In view of the above, the present invention provides a method for interpreting a hematological disease NGS report.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for interpreting a hematological disease (NGS) report, comprising the steps of:

s1: uploading a result data vcf file of the credit generation analysis;

s2: judging whether the uploaded file is a comment file, if so, executing the step S4, otherwise, executing the step S3;

s3: obtaining relevant filtering conditions of the detection items corresponding to the samples, carrying out filtering processing and generating annotation files;

s4: reading the mutation site information in the annotation file and storing the mutation site information in a structured manner;

s5: matching variation site information in a knowledge base;

s6: acquiring report data;

s7: acquiring a report template according to a submission unit;

s8: a report is generated and exported.

Further, the knowledge base comprises the following structural information:

gene: receiving and recording human gene related information including basic information, description information and other related information of the genes, wherein the basic information includes gene names, gene positions, gene types and common transcripts, and the other related information includes protein structural domains of the genes, related diseases and evidences;

mutation: dividing into parent variation and child variation;

parent variation: integrating and inducing the general variation related information, grading the variation grade of the variation related information, associating the variation of the type with the variation of the parent, inducing and summarizing the variation description content of the variation of the parent, collating variation summary information corresponding to diseases according to the related diseases of the variation of the type, collating related evidence summary information according to various types of targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution and drug metabolism;

sub-variation: the method comprises a plurality of types, wherein details of each type of sub-variation comprise basic information, variation description information and grading information related to the variation, related parent variation information is associated, and meanwhile, related evidence summary information is collated according to disease collation variation summary information related to the sub-variation and a plurality of types of related evidence collation evidence summary information is collated according to targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution and drug metabolism;

evidence is as follows: the method comprises approved medication, guidelines, clinical trials and literature evidences, wherein genes, variation and diseases targeted by the literatures are summarized and relevant information of evidence types, evidence relations and evidence grading is summarized through interpretation of different types of literatures;

diseases: clinical blood-related diseases are recorded and graded according to the diseases.

Further, the sub-variant types include:

SNP/InDel: single base variation, and insertion and deletion of small fragments;

fusion: fusion gene, two genes spliced together;

CNV: copy number variation, duplication of large fragments;

SV: other structural variations.

Further, the matching of the mutation site information in the knowledge base in step S5 specifically includes:

judging whether the code shift hotspot variation exists in each piece of site information, and if the code shift hotspot variation exists, correcting through a transvar; then, site knowledge base matching is carried out, matching is carried out through genes, transcripts and p points, if a plurality of knowledge base variation sites are matched, the variation site with the point c being empty is preferentially selected, if the variation site is not matched, matching is carried out through the genes, the transcripts and the point c, and if variation site information is matched, variation site reading information is returned;

if the unmatched variation is a frameshift or hotspot variation type, matching parent variation through the variation locus, if the matched corresponding parent variation locus information is clinically significant, indicating that the locus which is not recorded in the knowledge base is a clinically significant variation locus, automatically grading and preliminarily interpreting the locus according to related information, simultaneously recording the locus in the knowledge base for marking, and waiting for interpretation by an expert;

and if the mutation site information is not matched, entering a manual interpretation stage according to the situation, and finally confirming the report site through a report interpretation person and generating a report.

Further, the details of interpretation are as follows:

according to the classification of variation sites, the variation with clinical significance and potential clinical significance is found out; according to the hierarchical relation of clinical diagnosis diseases of patients in a disease tree, weighting and scoring are carried out on multiple evidences with different dimensions, such as targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution, drug metabolism and the like under the same variation, and optimal interpretation data are found out;

and (3) scoring the evidence weight: all the varied evidences comprise three dimensions of an evidence type, an evidence label and related diseases, the three dimensions are graded according to sample information and report purposes during interpretation, and the evidence with a low score is the best interpretation evidence;

type of evidence: including approved medications, guidelines, clinical trials, and literature types, for diagnostic reporting purposes;

evidence label: including treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution signatures, and also for judging condition reporting uses;

the related diseases are as follows: judging the score according to the hierarchical relation between the patient diagnosis diseases in the sample information and the disease numbers, wherein the rule is as follows: the priority of the node is 1 point as the priority of the child node, the brother node of the node and the child node below the brother node are 1.5 points, the score of the father node is 2 points, the score of the brother node of the father node and the child node below the brother node of the father node are 2.5 points, and the logic is used for grading all evidences step by step;

and after the scores of the three dimensions are obtained, multiplying the scores of the three dimensions as required to obtain a best matching evidence, or grouping and sequencing the three dimensions, and selecting the best interpretation data to generate a report.

The invention has the beneficial effects that: according to the invention, through structured storage of the letter generation analysis result, more accurate and multidimensional interpretation site data is convenient for users; detailed reading information can be rapidly inquired through the gene locus, information such as function influence, diseases, medicines and the like related to the reading information can be checked, each piece of information can be conveniently traced back to an original text of an information source, and reading contents can be updated in time according to the latest research; the decoded content can be gradually precipitated to form a knowledge base, and a user can automatically match the knowledge base after uploading second-generation sequencing data, mark the decoded gene locus and help the user to screen the locus. Each piece of reading information can be obtained according to multiple dimensions of information sources, gene variation types, diseases, medicines and the like. A user can generate a report by one key by selecting a gene locus needing to be written into the report, and the process from uploading sequencing data to outputting the report can be completed in a few minutes.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a general flowchart of a method for interpreting a hematological disease NGS report;

FIG. 2 is a diagram of a knowledge base architecture;

FIG. 3 is an automatic + manual matching flow chart;

FIG. 4 is an automatic + manual matching implementation;

FIG. 5 is a chart of specific disease categories given in the examples.

Detailed Description

The following embodiments of the present invention are provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and embodiments may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by the terms "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not intended to indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limiting the present invention, and the specific meaning of the terms described above will be understood by those skilled in the art according to the specific circumstances.

Referring to fig. 1, a reading method for NGS report of hematological diseases provided by the present invention includes the following steps:

uploading a vcf file;

judging whether the uploaded file is an annotation file, and if so, executing the fourth step;

thirdly, obtaining relevant filtering conditions of the detection items corresponding to the samples, filtering and generating annotation files;

fourthly, reading the information of the variation sites in the file and storing the information in a structured way;

matching knowledge base variation site information;

sixthly, report data is obtained;

seventhly, acquiring a report template according to a submission unit;

generating and exporting a report.

The structure of the knowledge base is shown in fig. 2 and is divided into the following parts: gene, variation, evidence, literature, disease.

Gene: the method mainly includes human gene related information, including basic information such as gene name, gene position, gene type and common transcript, gene description organized by an interpreter, description information of the genes in related websites such as genecards, omim and uniprot, and related information such as protein domains of the genes, related diseases and evidences.

Mutation: it is divided into two major categories, parent variation and child variation.

Parent variation: integrating and summarizing some common variation related information, carrying out variation grade grading on the variation related information, associating the sub-variation of the type with the parent variation, simultaneously carrying out induction and summarization on variation description content of the parent variation, collating variation summary information corresponding to diseases according to related diseases of the parent variation of the type, collating related evidence summary information according to various types of targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution and drug metabolism.

Sub-variation: the main categories are: SNP/InDel (single base variation, and indels of small fragments), Fusion (Fusion gene, splicing together of two genes), CNV (copy number variation, duplication of large fragments, SV (other structural variations).

The details of each sub-variation mainly comprise basic information related to the variation, variation description information and grading information, related parent variation information is associated, and meanwhile, related evidence summary information is collated according to disease collation variation information related to the sub-variation and various types of related evidence such as targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution and drug metabolism.

Evidence: the types of evidences including approved medicines, guidelines, clinical trials, documents and the like are included, and genes, variation and diseases targeted by the documents, and related information of evidence types, evidence relationships and evidence grading are summarized through interpretation of the different types of documents.

The specific steps of matching the knowledge base and interpreting are shown in fig. 3-4, and the patient vcf file is first imported into the system through related filtering of the detection items and formatted. And then, judging whether the code shift hot spot variation exists in each piece of site information, and if the code shift hot spot variation exists, correcting through a transvar. And then carrying out site knowledge base matching, carrying out matching through genes, transcripts and p points, preferentially selecting the mutation with the point c being empty if a plurality of knowledge base mutation sites are matched, carrying out matching through the genes, the transcripts and the point c if the mutation sites are not matched, and returning mutation site reading information if the mutation site information is matched.

Because the heterogeneity of the variant sites related to the blood diseases is very strong, even if the knowledge base is very perfect, the situation that the knowledge base cannot be matched can not be avoided, and therefore, a process for automatically discovering the variant sites is designed. When the unmatched mutation is a frameshift or hotspot mutation type, the parent mutation is matched through the mutation site, if the matched corresponding parent mutation site information is clinically significant, the site which is not recorded in the knowledge base is the clinically significant mutation site, the site is automatically graded and preliminarily interpreted according to the related information, and meanwhile, the site is recorded in the knowledge base and marked to wait for the deeper interpretation of an expert.

Interpretation: firstly, according to the classification of variation sites, the variation with clinical significance and potential clinical significance is found out. And secondly, according to the hierarchical relation of clinical diagnosis diseases of patients in a disease tree, weighting and scoring a plurality of evidences with different dimensions such as targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution, drug metabolism and the like under the same variation, and finding out the optimal interpretation data.

And (3) scoring the evidence weight: all evidence for variation includes approved drug, guideline, clinical trial and literature types, and treatment, prognosis, diagnosis, risk, clinical profile, population distribution, etc., as well as markers associated with standard diseases. When in interpretation, the three dimensions are scored according to sample information and report purposes, and the best interpretation evidence is obtained when the score is low.

(1) Type of evidence: judging the condition reporting use, if the reporting use is treatment, the evidence for approved medication and clinical trial scores 0.5 and the other scores are default 1.

(2) Evidence label: similar type scores are obtained, if the reported use is diagnosis, the evidence of the label related to the prognosis, diagnosis, risk and the like is between 0.5 and 1, and the others are 1.

(3) Related diseases: and judging the score according to the hierarchical relation between the sample information and the disease number of the patient diagnosis disease. The rule is that the self node and the child node are scored with 1 as the priority, the brother node and the child node below the self node are scored with 1.5, the father node is scored with 2, the brother node and the child node below the father node are scored with 2.5, and all evidences are scored step by step according to the logic.

As shown in FIG. 5, in this example, when the patient was diagnosed with mature B-cell lymphoma, the following scores were made as shown in Table 1 below:

TABLE 1

And finally, three dimensionality scores are obtained, the three scores can be multiplied as required to obtain the best matching evidence, or the three dimensions are grouped and sorted, and the best interpretation data is selected to generate a report, so that a clinician can conveniently and quickly know the disease details of the patient.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for interpreting a hematological disease (NGS) report, comprising: the method comprises the following steps:

s1: uploading a result data vcf file of the credit generation analysis;

s3: obtaining relevant filtering conditions of the detection items corresponding to the samples, filtering the relevant filtering conditions and generating annotation files;

s5: matching variation site information in a knowledge base;

s6: acquiring report data;

s7: acquiring a report template according to a submission unit;

s8: a report is generated and exported.

2. The method for interpreting a hematological disease NGS report according to claim 1, wherein: the knowledge base comprises the following structural information:

gene: receiving and recording human gene related information including basic information, description information and other related information of the gene, wherein the basic information includes a gene name, a gene position, a gene type and a common transcript, and the other related information includes a protein structure domain of the gene, related diseases and evidences;

mutation: dividing into parent variation and child variation;

parent variation: integrating and summarizing the general variation related information, grading the variation grade of the variation related information, associating the sub variation of the type with the parent variation, summarizing the variation description content of the parent variation, summarizing variation summary information corresponding to diseases according to the related diseases of the parent variation of the type, summarizing evidence summary information according to various types of targeted medication, treatment, prognosis, diagnosis, risk, clinical characteristics, population distribution and drug metabolism;

evidence: the method comprises approved medication, guidelines, clinical trials and literature evidences, wherein genes, variation and diseases targeted by the literatures, and related information of evidence types, evidence relations and evidence grading are summarized through interpretation of different types of the literatures;

3. The method for interpreting a hematological disease NGS report according to claim 1, wherein: the sub-variant types include:

fusion: fusion genes, two genes spliced together;

CNV: copy number variation, duplication of large fragments;

SV: other structural variations.

4. The method for interpreting a hematological disease NGS report according to claim 1, wherein: the matching of the mutation site information in the knowledge base in step S5 specifically includes:

5. The method for interpreting a hematological disease NGS report according to claim 1, wherein: the details of interpretation are as follows:

evidence label: including treatment, prognosis, diagnosis, risk, clinical features, population distribution signatures, and also for judging condition reporting purposes;

the related diseases are as follows: judging the score according to the hierarchical relation between the sample information and the disease numbers of the patient diagnosis diseases, wherein the rule is as follows: the priority of the node and the priority of the child node are graded into 1, the brother node of the node and the child node below the brother node are graded into 1.5, the score of the father node is graded into 2, and the score of the brother node of the father node and the child node below the brother node of the father node are graded into 2.5, and all evidences are graded by the logic step by step;

and after the three dimensionality scores are obtained, multiplying the three scores to obtain the best matching evidence according to the requirement, or grouping and sequencing the three dimensionalities, and selecting the best interpretation data to generate a report.