CN115148285B

CN115148285B - Information screening method, device, electronic equipment, medium and program product

Info

Publication number: CN115148285B
Application number: CN202210647422.2A
Authority: CN
Inventors: 余欢
Original assignee: Qitan Technology Ltd Beijing
Current assignee: Qitan Technology Ltd Beijing
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2023-08-22
Anticipated expiration: 2042-06-09
Also published as: CN115148285A

Abstract

The application discloses an information screening method, an information screening device, electronic equipment, a medium and a program product. The method comprises the steps of obtaining a case sample with a preset disease and a control sample corresponding to the case sample; the following steps are performed for each of the K genes in the case sample and the control sample: counting a first number of mutations in the allele in the case sample and a second number of mutations in the allele in the control sample; determining a third number of unmutated alleles in the case sample based on the total sample number and the first number of case samples; determining a fourth number of unmutated alleles in the control sample based on the total sample number and the second number of control samples; performing significance difference verification on the first quantity, the second quantity, the third quantity and the fourth quantity to obtain a verification result; based on the verification result, rare mutant genes causing the onset of the preset disease are selected from the K genes. The effect of efficiently screening rare genes is realized.

Description

Information screening method, device, electronic equipment, medium and program product

Technical Field

The present application relates to the field of information processing technologies, and in particular, to an information screening method, an apparatus, an electronic device, a medium, and a program product.

Background

A key loop in genetic disease, tumor research and treatment is the identification of new genes or mutation markers associated with the disease. Based on the method, the medical research and development, disease treatment, disease management and other health intervention means can be performed pertinently. Existing methods for mining new mutations or markers associated with diseases, such as methods for filtering and screening based on pure case samples, tend to have complex and diverse screening conditions, and relatively weaker performance in locking pathogenic genes due to the lack of negative control samples; or whole Genome association analysis (Genome-wide association study, GWAS) for increasing detection performance by adding control samples, it is often necessary to obtain Genome information of a large number of control samples at the same time, and insertion deletion (Insertion deletion, inDel) mutations are often omitted and rare mutation sites are omitted when candidate sites are screened in the gene detection process, but these rare mutation sites, including rare InDel mutations, often have a significant association with the diseases, and how to economically and efficiently screen rare mutation sites and corresponding pathogenic genes causing the diseases has not been well solved.

Disclosure of Invention

The embodiment of the application aims to provide an information screening method, an information screening device, electronic equipment, a medium and a program product, so as to realize the effect of efficiently screening rare mutation sites and corresponding genes which cause diseases.

The technical scheme of the application is as follows:

in a first aspect, an information screening method is provided, including:

acquiring N case samples with preset diseases and M control samples corresponding to the case samples, wherein N and M are positive integers;

the following steps are performed for each of K genes in the case sample and the control sample, respectively, K being a positive integer:

counting the sum of the mutation numbers of the allele in the N case samples to obtain a first number and the sum of the mutation numbers of the allele in the M control samples to obtain a second number;

determining a sum of the number of unmutated alleles in the case sample based on the total sample number of the case sample and the first number, resulting in a third number;

determining a sum of the number of unmutated alleles in the control sample based on the total sample number and the second number of control samples, resulting in a fourth number;

Performing significance difference verification on the first quantity, the second quantity, the third quantity and the fourth quantity to obtain a verification result;

and screening rare mutant genes causing the onset of the preset disease from the K genes based on the verification result.

In a second aspect, there is provided an information screening apparatus, the apparatus comprising:

the first acquisition module is used for acquiring N case samples with preset diseases and M control samples corresponding to the case samples, wherein N and M are positive integers;

a statistics module, configured to, for each of the K genes in the case sample and the control sample, count a sum of numbers of mutations in the allele in the N case samples to obtain a first number, and count a sum of numbers of mutations in the allele in the M control samples to obtain a second number;

a first determining module for determining, for each of the K genes in the case sample and the control sample, a sum of the number of unmutated alleles in the case sample based on a total sample number of the case sample and the first number, resulting in a third number;

A second determining module for determining, for each of the K genes in the case sample and the control sample, a sum of the number of unmutated alleles in the control sample based on a total sample number of the control sample and the second number, resulting in a fourth number;

the verification module is used for carrying out significance difference verification on the first quantity, the second quantity, the third quantity and the fourth quantity for each gene in the K genes in the case sample and the control sample to obtain a verification result;

and the screening module is used for screening rare mutant genes causing the pathogenesis of the preset diseases from the K genes based on the verification result.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor, a memory, and a program or an instruction stored in the memory and capable of running on the processor, where the program or the instruction implements the steps of the information screening method according to any one of the embodiments of the present application when executed by the processor.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, where a program or an instruction is stored, where the program or the instruction implements the steps of the information screening method according to any one of the embodiments of the present application when executed by a processor.

In a fifth aspect, embodiments of the present application provide a computer program product, where instructions in the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the information screening method according to any one of the embodiments of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

according to the information screening method provided by the embodiment of the application, N case samples with preset diseases and M control samples corresponding to the case samples are obtained; for each of the K genes in the case sample and the control sample: counting the sum of the mutation numbers of the allele in the case samples to obtain a first number and the sum of the mutation numbers of the allele in the control samples to obtain a second number; and then determining the sum of the number of the non-mutated alleles in the case sample to obtain a third number, and the sum of the number of the non-mutated alleles in the control sample to obtain a fourth number, and performing significant difference verification on the first number, the second number, the third number and the fourth number to obtain a verification result, wherein according to the verification result, rare mutant genes causing the preset disease onset can be determined, so that rare mutation sites causing the preset disease onset and corresponding rare mutant genes can be accurately screened.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application and do not constitute a undue limitation on the application.

Fig. 1 is a schematic flow chart of an information screening method according to an embodiment of the first aspect of the present application;

fig. 2 is a schematic structural diagram of an information screening apparatus according to an embodiment of the second aspect of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of a third aspect of the present application.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the particular embodiments described herein are meant to be illustrative of the application only and not limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the application by showing examples of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of implementations consistent with aspects of the application as set forth in the following claims.

Before introducing the technical scheme of the application, the background technology of the technical scheme of the application is introduced:

the key ring in the genetic disease, tumor research and treatment process is to determine the genes or mutation markers associated with the disease, and based on the genes or mutation markers, the targeted drug development, disease treatment or disease management and other health intervention means can be carried out.

With the popularization of sequencing technology, a large amount of human genome data and mutation information are generated, the progress of disease research and the discovery of genes or mutation markers are promoted, part of difficult and complicated diseases are solved, and part of genetic diseases and tumor patients are benefited.

The traditional screening method of candidate sites of diseases or tumors mainly comprises the following steps:

1) The main idea is to determine or infer the genetic pattern of the disease according to the disease manifestation of the family members, and then select the mutation site which is unique to the patient and is coseparated with the disease according to the genetic pattern of the disease.

2) And screening somatic mutations specific to tumor tissues based on a screening method of tumor somatic samples and germ line control samples.

Specifically, a sample of somatic cells of tumor tissue of a patient is compared with a germ line sample of non-tumor tissue (such as blood, non-tumor tissue cells, etc.), and somatic mutations specific to tumor tissue are selected.

3) A method for whole genome single nucleotide polymorphism (Single Nucleotide Polymorphisms, SNP) site association analysis based on case control screens mutations significantly enriched in case samples based on the principle that cases are different from those cases carrying mutation sites.

4) Screening methods based on mutation site characteristics and disease correlation, such as screening by utilizing characteristics of pathogenic sites that have low relative mutation frequency in the population and are more easily enriched in pathways related to disease occurrence, etc.

In the above screening method, the strategy of effectively excavating pathogenic genes for the development of diseased samples is less and more complicated than that of normal control family samples and tumor samples. The method of GWAS based on the cases of the ill population and the normal control is one of the most common and effective research methods aiming at the sporadic cases, the traditional GWAS analysis mainly uses a large sample size to search the difference of the number of people with common single nucleotide (Single Nucleotide Polymorphisms, SNP) (crowd mutation frequency AF > 1%) mutation in the case control group to explain the complex diseases, and InDel is difficult to be used for the GWAS analysis because the occurring position interval is not always fixed, so the traditional GWAS technology is often ignored. This is not only costly, but has the following disadvantages: 1) The effects of InDel are easily ignored. The occurrence of InDel can cause deletion or insertion of bases on genes, so that frame shift mutation is more likely to generate to cause gene truncation, thereby affecting the gene function, and therefore, the gene deletion or insertion should not be ignored during research; 2) The effects and information of rare SNPs and indels (crowd mutation frequency AF < 1%) are easily ignored (also relevant to previous technological limitations, more of which are chip technology, rare mutations are difficult to mine when the sample size is not large enough), whereas rare mutations, including rare SNPs and indels, are considered to be more likely to be relevant to disease; 3) Often, the difference between the number of mutated cases and the number of heterozygous cases in the control sample is counted, and the difference between the homozygous and heterozygous pathogenicity is easily ignored.

Therefore, the frequency difference of all rare mutations (including rare SNP and rare InDel) at different positions of a gene or a specific regulatory region in case and control samples is considered and analyzed simultaneously, so that the combined action of different types of mutations at different positions can be studied, and the genetic relationship and occurrence mechanism of a part of diseases can be explained. It has also been proved that, for both rare and common diseases, the research of rare mutations plays a vital role in the development of disease causative genes, and therefore, rare mutations (including rare SNPs and rare indels) should not be omitted during analysis, but should be emphasized as a class of mutation types. Focusing on rare mutations alone has the further advantage of requiring a smaller sample size to achieve the same statistical performance, which is more beneficial in reducing research investment costs.

The development of high-throughput sequencing technology is benefited, so that the whole genome or whole exome of a large number of samples can be directly sequenced, all mutations of all genes of the samples can be covered, and scientific researchers can better excavate rare mutation site information, so that correlation analysis based on rare mutation sites is promoted.

Although many efforts have been made to investigate rare mutations, a new path has been opened for disease research, demonstrating the effectiveness of this approach, but the efficacy of mining rare mutations alone is not high.

For sporadic samples, pathogenic genes are often mined by consensus screening or traditional GWAS methods. The common screening method is often poor in effect because of too many irrelevant mutation backgrounds; whereas the traditional GWAS method requires a certain number of samples to possess the same mutation site to have a sufficient number for statistical analysis, inDel mutations, as well as rare mutations that occur in low frequency individuals, are ignored. Whereas rare mutations (including rare SNPs and rare indels), particularly those that are rare and predicted to be deleterious by the tool, tend to contribute more strongly to disease development.

In order to solve the above problems, the present application provides an information screening method, apparatus, electronic device, medium, and program product by acquiring N case samples having a preset disease, and M control samples corresponding to the case samples; for each of the K genes in the case sample and the control sample: counting the sum of the mutation numbers of the allele in the case samples to obtain a first number and the sum of the mutation numbers of the allele in the control samples to obtain a second number; and then determining the sum of the number of the non-mutated alleles in the case sample to obtain a third number, and the sum of the number of the non-mutated alleles in the control sample to obtain a fourth number, and performing significant difference verification on the first number, the second number, the third number and the fourth number to obtain a verification result, wherein according to the verification result, rare mutant genes causing the preset disease onset can be determined, so that rare mutation sites causing the preset disease onset and corresponding rare mutant genes can be accurately screened.

The information screening method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of an information screening method according to an embodiment of the present application, where an execution body of the information screening method may be a server. The execution body is not limited to the present application.

As shown in fig. 1, the information screening method provided in the embodiment of the present application may include steps 110 to 160.

Step 110, N case samples with preset diseases and M control samples corresponding to the case samples are obtained.

Wherein N and M are positive integers.

Step 120, counting the sum of the numbers of mutations of the allele in the N case samples for each gene in the K genes in the case samples and the control samples to obtain a first number, and counting the sum of the numbers of mutations of the allele in the M control samples to obtain a second number.

Wherein K is a positive integer.

Step 130, for each of the K genes in the case sample and the control sample, determining a sum of the number of unmutated alleles in the case sample based on the total sample number and the first number of case samples, resulting in a third number.

Step 140, for each of the K genes in the case sample and the control sample, determining a sum of the number of unmutated alleles in the control sample based on the total sample number and the second number of control samples, resulting in a fourth number.

And 150, performing significant difference verification on the first quantity, the second quantity, the third quantity and the fourth quantity for each of the K genes in the case sample and the control sample to obtain a verification result.

Step 160, screening rare mutant genes causing the onset of the preset disease from K genes based on the verification result.

In the embodiment of the application, N case samples with preset diseases and M control samples corresponding to the case samples are obtained; for each of the K genes in the case sample and the control sample: counting the sum of the mutation numbers of the allele in the case samples to obtain a first number and the sum of the mutation numbers of the allele in the control samples to obtain a second number; and then determining the sum of the number of the non-mutated alleles in the case sample to obtain a third number, and the sum of the number of the non-mutated alleles in the control sample to obtain a fourth number, and performing significant difference verification on the first number, the second number, the third number and the fourth number to obtain a verification result, wherein according to the verification result, rare mutant genes causing the preset disease onset can be determined, so that rare mutation sites causing the preset disease onset and corresponding rare mutant genes can be accurately screened.

The information screening method provided by the embodiment of the application is described in detail below.

The preset disease may be a certain disease set in advance, for example, may be a red-green color blindness or the like.

The case sample may be a sample having the preset disease.

The control sample may be a sample without the preset disease.

Here, N and M may both be positive integers.

In some embodiments of the present application, the case samples and the control samples may be obtained through a public database, or the detected case samples may be used as case samples, and the detected samples without preset diseases may be used as control samples, i.e. self-test samples.

In some embodiments of the present application, the common database herein may include, but is not limited to, at least one of: thousands of genome databases, genome aggregation databases gnomAD, and national center for biotechnology information NCBI, etc.

In some embodiments of the application, these public databases contain a large amount of crowd mutation information. These public databases have been widely used in scientific research. These public databases can be used as controls to screen for relatively low frequency mutations in the disease or tumor under study. Although it is generally preferred that the control sample is a control that does not have a relevant phenotype according to the targeted selection of the phenotype of the disease under study, the aforementioned public database has an extremely large sample size, which overcomes the defect of not being a perfect control phenotype, and can be used as a natural population control, such as the data of individuals who are healthy individuals in the thousands of people's genome, while the gnomAD database has exon data of 125748 unrelated individuals and whole genome data of 15708 unrelated individuals, which can be used as a control for rare mutation association analysis.

That is, for example, the disease under study is achromatopsia, and cases without achromatopsia can be obtained from the public database as control samples (even if the control sample has other diseases, such as heart disease, it can be used as control samples for the achromatopsia case samples as long as it does not have achromatopsia). Therefore, the control sample can be obtained from the public database, so that the method is not limited to the self-test control sample, and the detection cost is saved.

At present, if the normal control is utilized to excavate the rare mutation enriched genes or areas of the case samples, the pathogenic factors are more accurately positioned, a large number of disease samples and normal control samples are needed to be used as statistical basis, and a certain technical development threshold is provided. In general, there is a tendency to collect the case samples under study, a large number of control samples are not available, or even if there are control samples collected, the sequencing and analysis costs are extremely high, taking 1000 case samples and 1000 control samples as an example, assuming that the whole genome sequencing analysis unit price of 1 sample is 5000 yuan, only 500 ten thousand of sequencing analysis cases are needed, and at the same time, the sequencing analysis control requires additional expenditure of 500 ten thousand of control sample sequencing analysis costs, which has a great additional cost.

Thus, the comparison sample is downloaded from the public database, and the detection cost is saved.

Step 120, counting the sum of the mutation numbers of the allele in the N case samples for each gene in the K genes in the case samples and the control samples to obtain a first number, and counting the sum of the mutation numbers of the allele in each gene in the control samples to obtain a second number.

Wherein the first number may be the sum of the number of mutations that occur in the allele of each gene in the statistical case sample.

The second number may be the sum of the number of mutations that occur in the allele of each gene in the statistical control sample.

In some embodiments of the present application, the first number may be counted based on a statistical algorithm, and the second number may be counted manually, which is not limited herein, and any manner in which the first number and the second number may be counted is within the scope of the present application.

It should be noted that each gene on the autosome corresponds to two alleles, and when heterozygote mutation exists in the gene, the number of allelic mutation in the gene is 1, and when homozygous mutation exists, the number of allelic mutation in the gene is 2. Since the focus is rare mutations, in most cases, one gene has only 1 or 0 rare mutations, and in very few cases has 2 or more rare mutations, in order to define that there are at most two allelic mutations on one gene, the number of rare mutations of this gene is recorded as 2 in the case where more than 2 rare mutations are accumulated on one gene. In order to highlight the pathogenic characteristics of homozygous pathogenic genes (the pathogenic characteristics of both alleles of the genes on the autosomes are shown to be different) relative to heterozygote pathogenic genes, the pathogenic characteristics of the two alleles on the autosomes are shown to be different, so that the number of the mutations of the alleles of each gene is counted, rather than simply counting the number of the mutations, and therefore, the information of missing the mutation difference of the homozygous mutant pathogenic in case and control samples is avoided; for example, a disease that is pathogenic to a homozygous mutation, 30 of 100 case samples have a homozygous rare mutation on a gene, and 20 of 100 control samples have a heterozygous mutation on the same gene as the case samples; if the number of people with the mutation of the gene is counted, the number of mutation of a case sample is 30, the number of non-mutation is 70, the number of mutation of a control sample is 20, the number of non-mutation is 80, the difference P value of the number of people with mutation in a case control sample is 0.1065 by using Fisher test, the number of mutation of the gene in the case and the control sample is not significantly different, and the gene cannot be considered to be related to the disease; if the number of allelic mutations of the gene is counted, the number of mutant alleles of the case sample is 30×2=60, the number of non-mutant alleles of the case sample is 100×2-30×2=140, the number of mutant alleles of the control sample is 20×1=20, the number of non-mutant alleles of the control sample is 100×2-20=180, the difference P value of the number of allelic mutations in the case control sample is 0.00000071 by Fisher test, and the number of mutant mutations of the gene in the case and the control sample have significant differences, so that the gene may be considered to be related to the disease; it can be seen from the examples that the statistics of the number of allelic mutations, relative to the statistics of the number of mutant persons, does not miss the difference information between cases caused by homozygous mutations and control samples.

In some embodiments of the present application, in order to further improve the efficiency of determining rare mutant genes causing the onset of the preset disease, the information screening method may further include, after step 110:

obtaining a gene detection result of a case sample and annotation information corresponding to a gene of the case sample;

and respectively filtering the loci of the genes in the case samples and the loci of the genes in the control samples under the same condition of loci based on the annotation information to obtain target mutation loci.

Wherein the annotation information may be a related annotation corresponding to a gene of the case sample. For example, it may be an annotation of the probability of mutation of a certain site of the certain gene in a population, whether the mutation of the gene is benign, or the like.

The site identity condition may be that the same screening condition is performed for the site. For example, the loci of each gene in the case sample and the loci of each gene in the control sample can be screened to obtain loci with a probability of mutation less than 0.1% in the population.

The target mutation site may be a mutation site left after the site of each gene in the case sample and the site of each gene in the control sample are subjected to the same condition filtration, respectively.

The K genes may be genes including a target mutation site.

In some embodiments of the present application, the gene test result of the case sample may be that the case sample is subjected to gene test, and then the gene test result is obtained.

The obtaining of the annotation information may be obtaining annotation information corresponding to the gene of the case sample from a public database. Specifically, the gene can be obtained from a thousand genome database, a genome aggregation database gnomAD, a national center for biotechnology information NCBI, a sequence database refseq, a nonsensical mutation function annotation database dbnsfp, a mutation site function gerp++ prediction database and a repeat sequence detection repeat asker database.

In some embodiments of the present application, the public databases contain a large amount of crowd mutation information and corresponding characteristic data information (i.e., annotation information). Therefore, the quality reliability of each site can be analyzed by utilizing characteristic information such as mutation and frequency of huge population in the public database, and characteristic data information (namely annotation information) can be added to each site.

In some embodiments of the application, the site identity conditions may include, but are not limited to, at least one of:

Non-coding region mutation, synonymous mutation, mutation with crowd frequency greater than a first preset value, mutation predicted by deleterious software to be benign, mutation of highly repeated region of genome, conservation prediction to be non-conservation, detection rate of gene locus less than a second preset value in case sample, detection rate of gene locus less than a third preset value in control sample and gene locus not conforming to Hardy-temperature balance in control sample.

The first preset value may be a preset threshold of crowd frequency. For example, it may be 0.001.

The second preset value may be a certain threshold value in the case samples for which the detection rate is smaller. For example, 95%.

The third preset value may be a certain threshold value smaller than the detection rate in the photo sample. For example, 60%.

In some embodiments of the present application, the target mutation is obtained by filtering the gene loci in the control sample and the gene loci in the case sample under the same conditions as those of the loci, and the gene information including the target mutation is obtained.

Correspondingly, the step 120 may specifically include:

based on the target mutation site, a first number of mutations in the allele of each gene comprising the target mutation in the case sample and a second number of mutations in the allele of each gene in the control sample are counted.

In one example, taking a control sample as an example, if there are 2 ten thousand sites in the control sample, after site filtering the 2 ten thousand sites, only 2000 sites remain, and the 2000 sites can be used as target mutation sites. The number of mutations that occurred in the allele of each gene containing these 2000 target mutation sites in the case sample and the control sample were then counted separately.

In the embodiment of the application, the target mutation sites are obtained by respectively carrying out site identical condition filtering on the sites of each gene in the case sample and the sites of each gene in the control sample according to the annotation information, and then the number of mutation of alleles of each gene containing the 2000 target mutation sites in the case sample and the control sample is counted respectively based on the target mutation sites, so that the number of mutation of excessive alleles is not required to be counted, the calculation speed is improved, and the efficiency of determining rare mutation genes causing the attack of preset diseases is improved.

Wherein the third number is the sum of the number of unmutated alleles in the case sample.

In some embodiments of the application, the number of unmutated alleles in the case sample may be determined from the total sample number and the first number of case samples. Specifically, the total number of samples of the case samples is multiplied by 2, and then the first number is subtracted to obtain the sum (i.e., the third number) of the number of unmutated alleles in the case samples.

Specifically, the third number may be obtained by the following formula (1):

NUM _third ＝2*N-NUM _first (1)

wherein NUM _third A third number NUM _first For the first number, N is the total number of samples of the case samples.

In some embodiments of the application, since one sample has two alleles and the number of mutations in a gene for a given rare mutation will generally not exceed 2, the number of mutations exceeding 2 will be noted as 2, the total number of samples of the case sample is multiplied by 2 as the total number of alleles.

Wherein the fourth number may be the sum of the number of unmutated alleles in the control sample.

In some embodiments of the present application, the fourth number of non-mutated alleles in the control sample may be determined according to the total sample number and the second number of control samples, specifically, the second number may be subtracted from the total sample number of the control sample multiplied by 2, so as to obtain the number of non-mutated alleles in the control sample (i.e., the fourth number).

Specifically, the fourth number may be obtained by the following formula (2):

NUM _fourth ＝2*N-NUM _second (2)

wherein NUM _fourth For a fourth number NUM _second For the second number, M is the total number of samples of the control samples.

In some embodiments of the application, the first number, the second number, the third number, and the fourth number may be significantly different checked for each of the K genes in the case sample and the control sample, specifically to check whether the screened rare mutations in the case sample and the control sample have significant load differences on each gene.

The test results herein may be values used to characterize whether rare mutations on each gene of the test case samples versus the control samples have significant differences in load.

In some embodiments of the application, the types of allelic mutations include Single Nucleotide (SNP) mutations and InDel (InDel) mutations.

In some embodiments of the application, rare mutant genes that lead to the onset of a predetermined disease can be derived from the K genes based on the above-described verification results. Specifically, if the rare mutation on each gene characterizing the check case sample and the control sample has a value of less than 0.05 as a target check result, it is determined that the gene is a rare mutation gene causing the onset of the preset disease.

In some embodiments of the application, the smaller the number of significant load differences between the rare mutations on each gene characterizing the check case samples and the control samples, the more relevant the gene is to the pre-set disease, with a focus on the rare mutant gene.

In some embodiments of the present application, in order to further accurately identify rare mutant genes that lead to the onset of a predetermined disease, step 150 may specifically include:

performing fisher accurate verification on the first quantity, the second quantity, the third quantity and the fourth quantity to obtain a first verification result;

correcting the first verification result to obtain a target verification result;

correspondingly, the step 160 may specifically include:

based on the target verification result, rare mutant genes causing the onset of the preset disease are selected from the K genes.

The first verification result may be a verification result obtained after performing fisher exact verification on the first number, the second number, the third number and the fourth number.

The target verification result may be a verification result obtained after correcting the first verification result.

In some embodiments of the present application, fisher exact verification (Fisher verification) is performed on the first number, the second number, the third number, and the fourth number, and then the first verification result may be corrected by using a verification method such as Bonferroni or an error discovery rate (False Discovery Rate, FDR) to obtain a target verification result.

It should be noted that the Fisher verification is only an example of the present application, and other verification methods, such as chi-square verification, may also be used. Any other verification method is also within the scope of the present application. Any other correction means as such fall within the scope of the present application.

It should be noted that, in a gene which is theoretically truly related to a disease, the allele ratio of mutation in a case will be far greater than that of mutation in a control sample.

In the embodiment of the application, the first verification result is obtained by performing fisher accurate verification on the first quantity, the second quantity, the third quantity and the fourth quantity; correcting the first verification result to obtain a target verification result, so that rare mutant genes causing the disease onset of the preset disease can be screened out based on the corrected target verification result, and the accuracy of determining the rare mutant genes causing the disease onset of the preset disease can be improved.

In some embodiments of the application, the verification result is a P value, which may be used to characterize the significance difference of the mutation load of the mutant gene in the case sample and the control sample, i.e., to characterize the significance association of the mutant gene with the preset disease.

Correspondingly, the step 160 may specifically include:

based on the P values of the K genes, the mutant gene corresponding to the P value less than or equal to the preset P value is determined as the rare mutant gene causing the onset of the preset disease.

The preset P value is a preset threshold value of the P value, and may specifically be 0.05.

In some embodiments of the application, the resulting mutant gene corresponding to a P value less than or equal to the predetermined P value among the P values corresponding to each mutant gene comprising rare mutations may be determined as a rare mutant gene that causes the onset of the predetermined disease.

In one example, if there are 3 mutant genes Q, W and E, the 3 mutant genes have corresponding P values of 0.02, 0.06 and 0.03, respectively, and the preset P value is 0.05, the P value of 0.05, which is small among the 3P values, is selected, namely 0.02 and 0.03, and then the mutant genes Q and E corresponding to 2P values of 0.02 and 0.03 are taken as rare mutant genes causing the onset of the preset disease.

In some embodiments of the present application, in order to enhance the user experience, after step 160, the information filtering method may further include:

outputting each mutant gene and the verification result corresponding to each mutant gene.

In the embodiment of the application, each mutant gene and the verification result corresponding to each mutant gene can be output in the form of a chart for a user to check, so that the user can intuitively check each mutant gene and the verification result corresponding to each mutant gene, and the user experience is improved.

In some embodiments of the present application, the first number, the second number, the third number, and the fourth number counted in steps 120-140 may also be stored, and if other case samples with the same site and the same condition are filtered subsequently, the data of the counted comparison sample may be directly selected for use, so that the statistics in steps 120-140 are skipped, and the calculation resources and time are saved.

It should be noted that the above method can also be used to find new rare mutant genes that have not been found before, which lead to the onset of preset diseases.

In an embodiment of the present application, an analysis device for implementing the above information screening method may be developed using a programming language (e.g., python or c++, etc.). The rare mutation information of the public database is utilized, and a correlation analysis tool based on the common database rare mutation as a reference gene or a custom region is developed, so that the blank of the tool in the technical direction is filled.

Specifically, the case sample and the control sample obtained in step 110 may be input into the aforementioned developed analysis device, and the subsequent steps 120 to 160 are performed to screen out rare mutant genes that cause the onset of the preset disease.

After the analysis device of the information screening method is developed, a front-end operation interface can be designed, so that the processes of data filtering, data statistics, case control statistics data inspection, drawing and the like can be developed into a software system by using a programming language, and a user can realize simple operation analysis through parameter selection.

The input objects of the analysis device are the acquired case samples and the acquired control samples, the platform is not limited, and the analysis device can be compatible with the use of multi-platform detection data.

In addition, the public database file after one statistics can be used for multiple times, so that the calculation cost is saved (the statistics needs to be reclassified when the filtering condition changes).

It should be noted that, in the information screening method provided in the embodiment of the present application, the execution subject may be an information screening device, or a control module in the information screening device for executing the information screening method.

Based on the same inventive concept as the information screening method, the application also provides an information screening device. The information screening apparatus according to the embodiment of the present application is described in detail below with reference to fig. 2.

Fig. 2 is a schematic diagram showing a structure of an information screening apparatus according to an exemplary embodiment.

As shown in fig. 2, the information filtering apparatus 200 may include:

a first obtaining module 210, configured to obtain N case samples with a preset disease, and M control samples corresponding to the case samples, where N and M are positive integers;

a statistics module 220, configured to count, for each of K genes in the case samples and the control samples, a sum of numbers of mutations occurring in the allele in the N case samples to obtain a first number, and a sum of numbers of mutations occurring in the allele in each of the M control samples to obtain a second number, where K is a positive integer;

a first determining module 230 for determining, for each of the K genes in the case sample and the control sample, a sum of the number of unmutated alleles in the case sample based on a total sample number of the case sample and the first number, resulting in a third number;

a second determining module 240 for determining, for each of the K genes in the case sample and the control sample, a sum of the number of unmutated alleles in the control sample based on a total sample number of the control sample and the second number, resulting in a fourth number;

A verification module 250, configured to perform a significant difference verification on the first number, the second number, the third number, and the fourth number for each of the K genes in the case sample and the control sample, to obtain a verification result;

and a screening module 260, configured to screen rare mutant genes that cause the onset of the preset disease from the K genes based on the verification result.

In the embodiment of the application, N case samples with preset diseases and M control samples corresponding to the case samples are acquired through a first acquisition module; counting the sum of the mutation numbers of the allele in the N case samples based on a statistics module for each gene in the K genes in the case samples and the control samples to obtain a first number and the sum of the mutation numbers of the allele in each gene in the M control samples to obtain a second number; then, based on the first determining module and the second determining module, the sum of the number of the non-mutated alleles in the case sample can be determined to obtain a third number, and the sum of the number of the non-mutated alleles in the control sample can be obtained to obtain a fourth number, then, based on the checking module, the first number, the second number, the third number and the fourth number are subjected to significant difference checking to obtain a checking result, and finally, based on the screening module, rare mutant genes causing the preset disease onset can be determined according to the checking result, so that rare mutation sites causing the preset disease onset and the corresponding rare mutant genes can be accurately screened out.

In some embodiments of the present application, in order to further improve the efficiency of determining rare mutant genes causing the onset of a preset disease, the information screening apparatus may further include:

the second acquisition module is used for acquiring the gene detection result of the case sample and annotation information corresponding to the gene of the case sample;

the filtering module is used for filtering the same conditions of the sites of the genes in the case sample and the sites of the genes in the control sample based on the annotation information to obtain target mutation sites; the K genes are genes containing the target mutation sites.

In some embodiments of the application, the site identity condition includes at least one of:

non-coding region mutation, synonymous mutation, mutation with crowd frequency greater than a first preset value, mutation predicted by deleterious software to be benign, mutation of highly repeated region of genome, conservation prediction to be non-conservation, gene locus with detection rate smaller than a second preset value in the case sample, gene locus with detection rate smaller than a third preset value in the control sample and gene locus which does not accord with Hardy-Wenberg equilibrium in the control sample.

In some embodiments of the application, to further accurately identify rare mutant genes that lead to the onset of a predetermined disease, the verification module may be specifically configured to:

correspondingly, the screening module 260 may specifically be configured to:

and screening rare mutant genes causing the onset of the preset disease from the K genes based on the target verification result.

In some embodiments of the application, the verification result is a P value, the P value being used to characterize the ratio difference of the mutant gene in the case sample to the control sample, and thus to characterize the significance difference of the mutant gene from the preset disease;

the screening module 260 may be specifically configured to:

and determining the mutant genes corresponding to the P values which are smaller than or equal to a preset P value in the P values as rare mutant genes causing the onset of the preset disease based on the P values of the K genes.

In some embodiments of the present application, the first determining module 230 may specifically be configured to: determining the sum of the number of unmutated alleles in the case sample based on the following formula, resulting in a third number:

NUM _third ＝2*N—NUM _first

In some embodiments of the present application, the second determining module 240 may specifically be configured to: determining the sum of the number of unmutated alleles in the control sample based on the following formula, resulting in a fourth number:

NUM _fourth ＝2*M—NUM _second

wherein NUM _fourth A fourth number; NUM (non-uniform memory access) _second A second number; m is the total number of samples of the control samples.

In some embodiments of the present application, to save detection costs, the first acquisition module 210 may specifically be configured to:

obtaining M control samples corresponding to the case samples from a public database;

or alternatively;

and based on the case samples, taking the tested samples without the preset diseases as M control samples corresponding to the case samples.

In some embodiments of the present application, the second obtaining module may specifically be configured to:

annotation information corresponding to the genes of the case samples is obtained from a public database.

In some embodiments of the application, the common database comprises at least one of: a thousand genome database, a genome aggregation database gnomAD, a sequence database refseq, a nonsensical mutation function annotation database dbnsfp, a mutation site function gerp++ prediction database and a repeat sequence detection repeatasker database.

In some embodiments of the present application, in order to improve the user experience, the information filtering apparatus may further include:

and the output module is used for outputting each mutant gene and the verification result corresponding to each mutant gene.

The information screening device provided by the embodiment of the application can be used for executing the information screening method provided by the embodiments of the method, and the implementation principle and the technical effect are similar, so that the description is omitted for the sake of brevity.

Based on the same inventive concept, the embodiment of the application also provides electronic equipment.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 3, the electronic device may include a processor 301 and a memory 302 storing computer programs or instructions.

In particular, the processor 301 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

Memory 302 may include mass storage for data or instructions. By way of example, and not limitation, memory 302 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 302 may include removable or non-removable (or fixed) media, where appropriate. Memory 302 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 302 is a non-volatile solid-state memory. The Memory may include read-only Memory (Read Only Memory image, ROM), random-Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash Memory devices, electrical, optical, or other physical/tangible Memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors) it is operable to perform the operations described in the information screening methods provided by the above embodiments.

The processor 301 implements any of the information screening methods of the above embodiments by reading and executing computer program instructions stored in the memory 302.

In one example, the electronic device may also include a communication interface 303 and a bus 310. As shown in fig. 3, the processor 301, the memory 302, and the communication interface 303 are connected to each other by a bus 310 and perform communication with each other.

The communication interface 303 is mainly used for implementing communication among the modules, devices, units and/or devices in the embodiment of the present invention.

Bus 310 includes hardware, software, or both, that couple components of the electronic device to one another. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 310 may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.

The electronic device may execute the information screening method in the embodiment of the present invention, thereby implementing the information screening method described in fig. 1.

In addition, in combination with the information screening method in the above embodiment, the embodiment of the present invention may be implemented by providing a readable storage medium. The readable storage medium has program instructions stored thereon; the program instructions, when executed by a processor, implement any of the information screening methods of the above embodiments.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to being, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware which performs the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the foregoing, only the specific embodiments of the present invention are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present invention is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and they should be included in the scope of the present invention.

Claims

1. An information screening method, the method comprising:

2. The method of claim 1, wherein after the obtaining N case samples with a preset disease and M control samples corresponding to the case samples, the method further comprises:

obtaining a gene detection result of the case sample and annotation information corresponding to the gene of the case sample;

based on the annotation information, respectively carrying out site identical condition filtering on the sites of the genes in the case sample and the sites of the genes in the control sample to obtain target mutation sites;

The K genes are genes containing the target mutation sites.

3. The method of claim 2, wherein the site identity condition comprises at least one of:

4. The method of claim 1, wherein performing a significant difference check on the first number, the second number, the third number, and the fourth number to obtain a check result comprises:

the screening of rare mutant genes causing the onset of the preset disease from the K genes based on the verification result comprises the following steps:

5. The method of any one of claims 1-4, wherein the verification result is a P value, the P value being used to characterize the difference in the ratio of the mutant gene in the case sample to the control sample, and thereby to characterize the significance of the mutant gene in relation to the predetermined disease;

6. The method of claim 1, wherein the determining the sum of the number of unmutated alleles in the case sample based on the total sample number of the case sample and the first number, results in a third number, comprising:

determining a sum of the number of unmutated alleles in the case sample based on the total sample number of the case sample and the first number based on the following formula, resulting in a third number:

NUM _third ＝2*N—NUM _first

7. The method of claim 1, wherein the determining the sum of the number of unmutated alleles in the control sample based on the total sample number and the second number of control samples, resulting in a fourth number, comprises:

determining the sum of the number of unmutated alleles in the control sample based on the total sample number and the second number of control samples based on the following formula, resulting in a fourth number:

NUM _fourth ＝2*M—NUM _second

8. The method of claim 1, wherein obtaining M control samples corresponding to the case samples comprises:

or alternatively;

9. The method of claim 2, wherein obtaining annotation information corresponding to a gene of the case sample comprises:

10. The method according to claim 8 or 9, wherein the common database comprises at least one of: a thousand genome database, a genome aggregation database gnomAD, a sequence database refseq, a nonsensical mutation function annotation database dbnsfp, a mutation site function gerp++ prediction database and a repeat sequence detection repeatasker database.

11. The method according to any one of claims 1-4, wherein after said obtaining a verification result, the method further comprises:

12. The method of claim 1, wherein the types of allelic mutations include Single Nucleotide (SNP) mutations and InDel (InDel) mutations.

13. An information screening apparatus, the apparatus comprising:

the statistics module is used for counting the sum of the mutation numbers of the allele in the N case samples to obtain a first number and the sum of the mutation numbers of the allele in the M control samples to obtain a second number for each gene in the K genes in the case samples and the control samples, wherein K is a positive integer;

14. An electronic device, the electronic device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the information screening method of any one of claims 1-12.

15. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions, which when executed by a processor, implement the information screening method according to any of claims 1-12.