CN116935961A - Source detection method and device for copy number repeat variation - Google Patents
Source detection method and device for copy number repeat variation Download PDFInfo
- Publication number
- CN116935961A CN116935961A CN202310851930.7A CN202310851930A CN116935961A CN 116935961 A CN116935961 A CN 116935961A CN 202310851930 A CN202310851930 A CN 202310851930A CN 116935961 A CN116935961 A CN 116935961A
- Authority
- CN
- China
- Prior art keywords
- mutation
- data
- family
- point mutation
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 36
- 230000035772 mutation Effects 0.000 claims abstract description 409
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000006870 function Effects 0.000 claims description 38
- 239000012634 fragment Substances 0.000 claims description 31
- 238000012163 sequencing technique Methods 0.000 claims description 31
- 108090000623 proteins and genes Proteins 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 14
- 238000012216 screening Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 10
- 230000003252 repetitive effect Effects 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 238000003908 quality control method Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 2
- 230000002068 genetic effect Effects 0.000 abstract description 7
- 210000000349 chromosome Anatomy 0.000 description 23
- 230000002759 chromosomal effect Effects 0.000 description 7
- 230000002159 abnormal effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 239000002243 precursor Substances 0.000 description 4
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 208000026487 Triploidy Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 230000021121 meiosis Effects 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 238000012070 whole genome sequencing analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Chemical & Material Sciences (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a source detection method and a device for copy number repeated variation, wherein the method comprises the following steps: after obtaining mutation data about copy number repeated mutation and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data; calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type; determining the family source of the mutation data based on the point mutation density distribution ratio value. The application can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.
Description
Technical Field
The application relates to the technical field of chromosome detection, in particular to a source detection method and device for copy number repeated variation.
Background
Along with the development of technology, the gene detection technology is becoming usual, and the application scene is becoming wide. One common technique for detecting abnormal genes is a technique for detecting the origin of copy number repeat variation by scanning the gene or whole genome to find the DNA sequence of the repeat variation in the gene, determining a biological phenotype based on the DNA sequence of the repeat variation, and matching the biological phenotype with the male parent phenotype or female parent phenotype to determine the origin of the copy number repeat variation in the chromosome. However, the phenotype and the repeated variation of the copy number may be related or unrelated, and the difference of the phenotypes may be influenced by environmental factors, so that only the phenotype is matched with the phenotype of the male parent or the phenotype of the female parent, the detection result is greatly different from the actual result, and the detection accuracy is low.
Disclosure of Invention
The application provides a source detection method and a device for copy number variation, wherein the method can acquire copy number repeated variation data and determine related family point mutation data, and determine variation sources according to the density distribution ratio of the family point mutation data and the repeated variation data so as to attach to a genetic principle, thereby improving the detection precision and accuracy.
A first aspect of an embodiment of the present application provides a method for detecting a source of duplicate variation in copy number, the method comprising:
after obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data;
calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
determining the family source of the mutation data based on the point mutation density distribution ratio value.
In a possible implementation manner of the first aspect, the point mutation density distribution proportion value includes: a first distribution ratio value;
the calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
and converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
In a possible implementation manner of the first aspect, the point mutation density distribution ratio value further includes: a second distribution ratio value;
the calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
and converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
In a possible implementation manner of the first aspect, the determining the source of the mutation data based on the point mutation density distribution ratio value includes:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
if the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
In a possible implementation manner of the first aspect, the determining the source of the mutation data based on the point mutation density distribution ratio value further includes:
if neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
In a possible implementation manner of the first aspect, the preset kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
in a possible implementation manner of the first aspect, the obtaining family point mutation data corresponding to the mutation data includes:
acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
A second aspect of embodiments of the present application provides a source detection device for copy number repeat variation, the device comprising:
the acquisition and determination module is used for determining the previous generation mutation type corresponding to the family point mutation data and determining the current generation mutation type corresponding to the mutation data after acquiring the mutation data related to the repeated mutation of the copy number and acquiring the family point mutation data corresponding to the mutation data;
the distribution proportion value calculating module is used for calculating the point mutation density distribution proportion value of the mutation data accounting for the family point mutation data according to the previous generation mutation type and the current generation mutation type;
and a determining family source module for determining family sources of the mutation data based on the point mutation density distribution ratio value.
Compared with the prior art, the source detection method and device for copy number repeated variation provided by the embodiment of the application have the beneficial effects that: the application can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.
Drawings
FIG. 1 is a flow chart of a method for detecting duplicate copy number variation according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for detecting a source of copy number repeat variation according to an embodiment of the present application;
FIG. 3 is a schematic diagram of copy number results provided by an embodiment of the present application;
FIG. 4 is a graph showing the distribution of the proportion density of point mutations of the prior evidence corresponding to two combinations in the case of three copies according to one embodiment of the present application;
FIG. 5 is a schematic diagram of a source detection device for copy number repeat variation according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to solve the above problems, a source detection method for copy number repeat variation provided in the present embodiments will be described and illustrated in detail by the following specific examples.
Referring to fig. 1, a flow chart of a method for detecting a source of copy number duplication variation according to an embodiment of the application is shown.
In one embodiment, the method is applicable to a computer system, and the abnormal gene data can be input into the computer system, and the computer system is used for detecting and analyzing the abnormal gene data or gene fragments to determine the specific source of the abnormal gene data.
The method for detecting the source of the copy number repetitive variation may include:
s11, after obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data.
One common type of variation in copy number variation is the repeat variation type, wherein the most common type of expression in the human genome is also a triploid of a chromosome segment, and by detecting the source of the repeat variation data, subsequent gene detection can be effectively performed according to the source thereof.
In one embodiment, the mutation data about repeated variation in chromosome copy number may include a mutation fragment to be detected (may be a mutated gene fragment in particular) and a mutation type (possible hybridization type, homozygous type, etc.). The family point mutation data may be data of variation points included in each family member corresponding to the variation data. Wherein, family members may be family members of a direct family relatives.
In an embodiment, determining the previous generation mutation type corresponding to the family point mutation data may be determining a gene type of the family member in the family point mutation data, which generates copy number repeated mutation;
the current mutation type corresponding to the mutation data is determined, and the current mutation type can be specifically the gene type of the mutation data with repeated mutation of the copy number.
The above types may be homozygous mutations or wild-type mutations.
Because there may be a plurality of each family member corresponding to the mutation data, if gene data of all family members are collected for detection, the amount of data to be processed is large, and the time consumption is long.
As an example, step S11 may include the following sub-steps:
s111, acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is gene data of a family member corresponding to the mutation data.
In an alternative embodiment, family sequencing raw data for a number of family members may be obtained, which may be genetic data for the family members.
In genetic inheritance, the influence of the alternate genes is smaller, in order to further reduce the amount of processed data, in a preferred embodiment, family sequencing original data of a father parent corresponding to the mutation data (namely family sequencing original data of a father parent and family sequencing original data of a mother) can be obtained, the mutation sources are judged according to the mutation fragments and the dissimilarity between all the point mutation information of three families of the father parent and the mother parent, so that the inheritance characteristics of chromosomes and the characteristic of wide distribution of the point mutations can be reserved, and the fixed characteristics shown by different copy number mutation types can be relied on, so that the judgment of the results is completed, and the detection accuracy is improved.
In an embodiment, after obtaining the family sequencing original data of the male parent and the family sequencing original data of the female parent, the steps of data cleaning, data quality control, sequencing data comparison, mutation detection, mutation information filtering, mutation data annotation and the like can be sequentially performed on the obtained family sequencing original data and the obtained family sequencing original data respectively to form family processing data.
Alternatively, the processed family sequencing raw data may be converted into a family point variation information summary table containing variation information of all members of the family for subsequent processing.
In one embodiment, all the point mutation information of the family members can be recorded in a family point mutation information summary table, wherein the information includes mutation conditions of the point mutation (including base information before and after mutation), quality values of the point mutation, heterozygous conditions of the point mutation, counting information of the point mutation (recording the number of sequencing fragments with mutation at the current mutation site and the number of sequencing fragments covering the current site), and the proportion of the point mutation (the proportion of the number of sequencing fragments with mutation at the current base site to the number of sequencing fragments covering the current site), and the point mutation information is necessary information for subsequently judging the source of copy number mutation.
And S112, merging the point mutation data contained in the family processing data, and extracting a union set of the merged data to obtain a point mutation data set.
In one embodiment, after preprocessing is completed, point mutation data of family processing data of a male parent can be extracted, point mutation data of family processing data of a female parent can be extracted, and then mutation data of the two point mutation data are combined to form a data set. And extracting the same point mutation data in the two family processing data in the data set, namely obtaining the union of the two data to form a point mutation data set.
And S113, carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain the family point mutation data.
In order to further reduce the amount of processed data, the mutation fragments contained in the mutation data can be obtained, then the same fragments are screened in the point mutation data set according to the mutation fragments, and the same mutation fragments obtained by screening are formed into a data set to form the family point mutation data.
In yet another alternative embodiment, the point mutation data set includes point mutation information owned by both the male parent and the female parent, and after all the point mutation information is obtained, the point mutations whose quality values do not satisfy the minimum threshold value may be filtered, so as to reduce the number of unnecessary or useless point mutation information.
Specifically, the user can preset the segment to be judged, then analyze the position of the preset segment, and screen out all the point mutation information of the family members in the segment according to the position. And after screening to obtain the point mutation information of all members of the family on the current fragment, starting to judge the mutation sources of different types.
In yet another alternative embodiment, in the screening of point mutations, it may be considered to delete point mutations of the same mutation type (i.e., point mutations are identical and are all homozygous, wild-type or homozygous) of family members in order to make the results more prone.
In addition, in some cases, it is also possible to find new chromosomal copy number variation segments by comprehensively analyzing all the point mutation information contained on a specific chromosomal segment of all members of the family.
S12, calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type.
Because the family point mutation data comprises the point mutation data of the male parent and the point mutation data of the female parent, the previous generation mutation type corresponding to the family point mutation data can comprise the mutation type of the male parent or the mutation type of the female parent.
After the previous generation mutation type and the current generation mutation type are determined, the current generation mutation type and the previous generation mutation type can be combined, and the point mutation density distribution proportion value of the point mutation quantity contained in the mutation data and the point mutation quantity contained in the family point mutation data is calculated, so that the mutation source can be determined according to the distribution proportion value.
In an alternative embodiment, the point mutation density distribution ratio value includes: a first distribution ratio value;
as an example, step S12 may include the following sub-steps:
s121, if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, and the current generation mutation type is heterozygous mutation, extracting a first point mutation proportion array from the mutation data.
S122, converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
For triploid repetition of chromosome fragments, most of them are caused by providing one chromosome on either of the parents, so that the core of determining the source is to determine the source of one chromosome in excess of the users (also called pre-markers) of the mutation data to be detected.
Because the prover contains three chromosome fragments, both parents inherit at least one section of chromosome to the prover, at the moment, the prover cannot directly start judging from the parents due to the three chromosome fragments, but needs to start from the prover, check the point mutation combination form of the three chromosomes of the prover, and then contact the mutation combination form of both parents at the points to judge the source of the redundant chromosome.
Starting from the forensics, when screening whether the forensics are homozygous or wild-type mutations, it is not possible to determine which party the excess chromosomes of the forensics are from, regardless of the point mutation types of both parents.
For example, when the forensics are homozygous, both parents must be homozygous or heterozygous, and when both parents are homozygous, it cannot be judged; when parents are heterozygous, the parents cannot judge the parents; when one is heterozygous and the other is homozygous, the homozygous one may inherit both copies to the prover, while the heterozygous one may recombine at meiosis and also inherit both homozygous copies to the prover, so that when the prover is homozygous or wild-type mutant, the source of the variation cannot be determined.
When screening the precursor point mutation into the heterozygous mutation, any mutation position should contain two mutation bases and one wild base or one mutation base and two wild bases, so that the ratio of the mutation bases in these heterozygous mutation positions to all bases should be 1/3 or 2/3, and thus the ratio of the number of sequencing fragments containing the mutation to the number of sequencing fragments covering this position should be almost equal to 1/3 or 2/3, and the origin of the redundant chromosome can be determined by using this ratio. In the case of such heterozygous mutations, if one of the parents is heterozygous, it is not possible to determine which party inherited the point mutation to the forensic person; if both parents are selected as homozygous mutation and wild mutation, respectively, the source can be judged according to the heterozygosity ratio of the forerunner in the two cases.
Specifically, when parents are selected to be homozygous and wild mutant respectively, all point mutations related to heterozygous mutation can be extracted from mutation data of a prover, and an array formed by all mutation ratios of the point mutations is obtained to obtain a first point mutation ratio array.
Solving the density distribution of the mutation proportion by utilizing a preset nuclear density estimation function (Kernel Density Estimation), and obtaining the mutation proportion of the highest point in the density distribution. The preset kernel density estimation function principle is to convert a histogram similar to frequency distribution into a smooth density distribution diagram through a kernel function, so as to obtain a first density distribution curve.
Then, the most common gaussian kernel for density estimation can be used. The highest mutation density distribution value obtained by fitting the kernel density estimation function to the first density distribution curve is the range of the main distribution of the mutation proportion under the current screening condition, namely the first distribution proportion value. Further, the source of the repeated chromosome can be judged according to the judgment logic through the value.
In an alternative embodiment, the point mutation density distribution ratio value includes: a second distribution ratio value;
as an example, step S12 may include the following sub-steps:
s123, when the previous generation mutation type is a wild mutation and the female parent is a homozygous mutation, extracting a second point mutation proportion array from the mutation data when the current generation mutation type is a heterozygous mutation.
S124, converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
In this embodiment, steps S123-S124 are the same as steps S121-S122, and the analysis described above may be specifically performed.
The operation is specifically as follows: when parents are homozygous and wild mutation respectively are screened, all point mutations related to heterozygous mutation can be extracted from mutation data of a precursor, and an array formed by all mutation ratios of the point mutations is obtained to obtain a first point mutation ratio array.
Solving the density distribution of the mutation proportion by utilizing a preset nuclear density estimation function (Kernel Density Estimation), and obtaining the mutation proportion of the highest point in the density distribution. The preset kernel density estimation function principle is to convert a histogram similar to frequency distribution into a smooth density distribution diagram through a kernel function, so as to obtain a second density distribution curve.
Then, the most common gaussian kernel for density estimation can be used. The highest mutation density distribution value obtained by fitting the kernel density estimation function to the second density distribution curve is the range of the main distribution of the mutation proportion under the current screening condition, namely the second distribution proportion value. Further, the source of the repeated chromosome can be judged according to the judgment logic through the value.
In one embodiment, the preset kernel density estimation functions used in steps S121-S122 and steps S123-S124 are as follows:
the kernel function K used in the above formula, that is, the preset kernel functions used in steps S121 to S122 and steps S123 to S124, is a gaussian distribution kernel function, as described in the following formula:
in the above formula, the value p obtained by the function is the probability of being distributed at a certain value x, the value h in the formula is the set seed width (bandwidth), the larger the seed width is, the smoother the curve is, n represents the number of all points appearing in the seed width, and all Xi appearing in the seed width range of x will participate in the calculation of the point density value by the definition of the kernel density estimation function.
S13, determining the family source of the mutation data based on the point mutation density distribution proportion value.
In an embodiment, after the first distribution ratio value and the second distribution ratio value are calculated, it may be determined whether the family source of the variation data is the father or the mother according to the values of the first distribution ratio value and the second distribution ratio value.
In one embodiment, step S13 may include the sub-steps of:
s131, if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent.
S132, if the first distribution ratio value is smaller than a second preset value and the second distribution ratio value is larger than the first preset value, determining that the family source of the variation data is a female parent.
S133, if the first distribution ratio value or the second distribution ratio value does not meet a preset value, determining the family source of the variation data.
Specifically, after the first distribution ratio value and the second distribution ratio value are calculated, the repeated chromosome source side can significantly change the point mutation heterozygosity ratio of the precursor because one more chromosome segment is provided.
When screening the first evidence as heterozygous mutation, if one more chromosome from one parent is provided, the homozygous mutation provided by the first evidence is combined with the wild mutation provided by the other parent, and the heterozygous ratio of the first evidence at the sites is about 2/3; when the wild mutation provided by it is combined with the homozygous mutation provided by the other party, the heterozygous ratio of the precursor should be approximately 1/3.
In the actual judgment, the density distribution maximum values are generally distributed in the vicinity of 1/3 and 2/3 and are not completely equal due to the influence of various factors, so that the density distribution maximum values d1 and d2 formed according to the two combinations are compared with a first preset value (2/3 corresponds to approximately 0.6) and a second preset value (1/3 corresponds to approximately 0.4), and if one of the two distribution ratio values is larger than the first preset value (0.6) and the other is smaller than the second preset value (0.4), the genetic source of the redundant chromosome can be clearly judged.
The method comprises the following steps: the first distribution proportion value is larger than 0.6 and the second distribution proportion value is smaller than 0.4, the family source of the variation data is determined to be the male parent, the first distribution proportion value is smaller than 0.4 and the second distribution proportion value is larger than 0.6, and the family source of the variation data is determined to be the female parent. If the two cases are not the case, the family source of the mutation data cannot be determined.
Referring to FIG. 2, a flowchart illustrating an exemplary method for detecting a source of copy number repeat variation is shown.
Specifically, the mutation data such as the position of the mutation fragment to be detected or judged and the mutation type of the mutation fragment to be judged can be obtained, and standard family sequencing original data of a person to be detected can be obtained, and preprocessing operations such as data cleaning, quality control, comparison, mutation detection, data filtering and the like are performed on the family sequencing original data; combining the point mutation data of family members, and taking the union of all the point mutation data; and screening out all family point mutation data on the fragments from the union according to the positions of the variant fragments. When the first evidence is heterozygous mutation, the father is homozygous mutation, the mother is wild mutation, an array of the point mutation proportion of the first evidence can be obtained, and then the array is adopted to calculate the distribution proportion value of the first evidence, so as to obtain a first distribution proportion value; similarly, when the first evidence is heterozygous mutation, the father is wild mutation, and the mother is homozygous mutation, an array of the point mutation ratios of the first evidence can be obtained, and then the array is adopted to calculate the distribution ratio value of the first evidence, so as to obtain a second distribution ratio value.
If the first distribution proportion value is larger than 0.6 and the second distribution proportion value is smaller than 0.4, determining the family source of the variation data as a male parent; if the first distribution ratio value is smaller than 0.4 and the second distribution ratio value is larger than 0.6, determining the family source of the variation data as the female parent. If the two cases are not the case, the family source of the mutation data cannot be determined.
Referring to FIG. 3, a schematic diagram of copy number results provided by an embodiment of the present application is shown.
In a test example for determining the origin of duplication of copy number, the results of copy number detected for genes on fragments are shown in FIG. 3 below. The vast majority of genes on this fragment were three copies, so this fragment data was used as test data.
After screening all the point mutation data on the segment of the test example in fig. 3, the mutation ratio array of all the point mutations obtained by obtaining the first evidence that the heterozygous father homozygous mother is the wild mutation is (0.69,0.75,0.76,0.62,0.63,0.69,0.71,0.67,0.62,0.64,0.5,0.67,0.89,0.64,0.66,0.82,0.73,0.62,0.65,0.71,0.68,0.62,0.71,0.63,0.77,0.99,0.62,0.67,0.66,0.74,0.66,0.67,0.74,0.76,0.6,0.6,0.71,0.65,0.6,0.65,0.69,0.65,0.65,0.67,0.65,0.6,0.82,0.68,0.6,0.72,0.79,0.71,0.69), the mutation ratio array of all the point mutations obtained by obtaining the first evidence that the heterozygous father wild mother is homozygous is (0.33,0.3,0.33,0.36,0.38,0.26,0.31,0.36,0.3,0.31,0.31,0.31,0.28,0.32,0.32,0.23,0.23,0.25,0.32,0.36,0.36,0.35,0.31,0.4,0.33,0.33,0.43,0.28,0.28,0.33,0.33,0.34,0.29,0.28,0.31,0.28,0.4,0.39,0.36,0.26,0.21,0.2,0.37,0.83,0.35,0.39,0.34,0.34,0.22,0.22,0.31,0.28), and the obtained graphs are respectively shown in fig. 4 after the two arrays are converted by the kernel density estimation function.
As shown by the results in FIG. 4, the mutation ratios in both cases were close to 2/3 and 1/3, respectively, so that it could be judged that one more chromosomal segment was inherited by the father on this chromosomal segment to the forerunner according to the judgment of FIG. 2.
The method for judging the copy number repeat source can identify whether the source of the repeated chromosome fragment causing the copy number repeat is a father or a mother, so that the corresponding disease pathogenic source can be predicted through gene assistance.
It should be noted that, when it is detected that a certain chromosomal segment in the genome of a prover is three copies according to some means for detecting CNV, since such a repetitive segment may affect some gene functions of the prover, thereby causing an abnormal phenotype to appear in the prover, in order to further enhance the understanding of the copy number variation of the chromosome, it is necessary to determine the relationship between an extra chromosomal segment in the genome of the prover and the abnormal phenotype, and determining the source of the repetitive chromosomal segment can analyze which parent the source of the variant repetitive segment is, so as to enhance the understanding of the disease suffered by the prover, and to conduct intensive study on the disease pathogenesis.
Both procedures for determining the source of the fragment copy number variation require at least the original data of whole exon sequencing or whole genome sequencing of a standard three-person family comprising a forerunner, and point mutation data which can be used for determining the source of the variation are finally obtained according to the processing of the original data.
In this embodiment, the embodiment of the present application provides a source detection method for copy number repeat variation, which has the following beneficial effects: the application can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.
The embodiment of the application also provides a source detection device for copy number repeat variation, and referring to fig. 5, a schematic structure diagram of the source detection device for copy number repeat variation according to an embodiment of the application is shown.
Wherein, as an example, the source detection device for copy number repetition variation may include:
an acquisition and determination module 501, configured to determine a previous generation mutation type corresponding to the family point mutation data and determine a current generation mutation type corresponding to the mutation data after acquiring mutation data related to copy number repeated mutation and acquiring family point mutation data corresponding to the mutation data;
the distribution ratio calculating module 502 is configured to calculate a point mutation density distribution ratio of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
a determining family source module 503, configured to determine a family source of the mutation data based on the point mutation density distribution ratio value.
Optionally, the point mutation density distribution ratio value includes: a first distribution ratio value;
the module for calculating the distribution proportion value is further used for:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
and converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
Optionally, the point mutation density distribution ratio value further includes: a second distribution ratio value;
the module for calculating the distribution proportion value is further used for:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
and converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
Optionally, the determining family source module is further configured to:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
if the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
Optionally, the determining family source module is further configured to:
if neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
Optionally, the preset kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
optionally, the acquiring and determining module is further configured to:
acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
It will be clearly understood by those skilled in the art that, for convenience and brevity, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
Further, an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed implements the source detection method for copy number repeat variation as described in the above embodiments.
Further, an embodiment of the present application also provides a computer-readable storage medium storing a computer-executable program for causing a computer to execute the source detection method for copy number duplication variation as described in the above embodiment.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.
Claims (10)
1. A method for detecting a source of repeat variation in copy number, the method comprising:
after obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data;
calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
determining the family source of the mutation data based on the point mutation density distribution ratio value.
2. The method for detecting a source of repetitive variation according to claim 1, wherein the point mutation density distribution ratio value comprises: a first distribution ratio value;
the calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
and converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
3. The method for detecting a source of repetitive variation according to claim 2, wherein the point mutation density distribution ratio value further comprises: a second distribution ratio value;
the calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
and converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
4. The method according to claim 3, wherein determining the source of the mutation data based on the point mutation density distribution ratio value comprises:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
if the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
5. The method according to claim 4, wherein determining the source of the mutation data based on the point mutation density distribution ratio value, further comprises:
if neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
6. A source detection method for copy number repeat variation as claimed in any one of claims 2 or 3, wherein the predetermined kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
7. the method for detecting a source of repetitive variation according to any one of claims 1 to 5, wherein the obtaining family point mutation data corresponding to the variation data comprises:
acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
8. A source detection device for copy number repeat variation, the device comprising:
the acquisition and determination module is used for determining the previous generation mutation type corresponding to the family point mutation data and determining the current generation mutation type corresponding to the mutation data after acquiring the mutation data related to the repeated mutation of the copy number and acquiring the family point mutation data corresponding to the mutation data;
the distribution proportion value calculating module is used for calculating the point mutation density distribution proportion value of the mutation data accounting for the family point mutation data according to the previous generation mutation type and the current generation mutation type;
and a determining family source module for determining family sources of the mutation data based on the point mutation density distribution ratio value.
9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the source detection method for copy number repeat variation as claimed in any one of claims 1 to 7 when the program is executed.
10. A computer-readable storage medium storing a computer-executable program for causing a computer to execute the source detection method for copy number duplication variation according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310851930.7A CN116935961B (en) | 2023-07-12 | 2023-07-12 | Source detection method and device for copy number repeat variation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310851930.7A CN116935961B (en) | 2023-07-12 | 2023-07-12 | Source detection method and device for copy number repeat variation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116935961A true CN116935961A (en) | 2023-10-24 |
CN116935961B CN116935961B (en) | 2024-06-25 |
Family
ID=88382035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310851930.7A Active CN116935961B (en) | 2023-07-12 | 2023-07-12 | Source detection method and device for copy number repeat variation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116935961B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785899A (en) * | 2019-02-18 | 2019-05-21 | 东莞博奥木华基因科技有限公司 | A kind of device and method of genotype correction |
CN111341383A (en) * | 2020-03-17 | 2020-06-26 | 安吉康尔(深圳)科技有限公司 | Method, device and storage medium for detecting copy number variation |
US20210366570A1 (en) * | 2017-07-25 | 2021-11-25 | Sophia Genetics Sa | Methods for detecting biallelic loss of function in next-generation sequencing genomic data |
CN115433777A (en) * | 2022-10-26 | 2022-12-06 | 北京中仪康卫医疗器械有限公司 | Integrated identification method for CNV, SV and SGD abnormalities and abnormal sources of embryos |
-
2023
- 2023-07-12 CN CN202310851930.7A patent/CN116935961B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210366570A1 (en) * | 2017-07-25 | 2021-11-25 | Sophia Genetics Sa | Methods for detecting biallelic loss of function in next-generation sequencing genomic data |
CN109785899A (en) * | 2019-02-18 | 2019-05-21 | 东莞博奥木华基因科技有限公司 | A kind of device and method of genotype correction |
CN111341383A (en) * | 2020-03-17 | 2020-06-26 | 安吉康尔(深圳)科技有限公司 | Method, device and storage medium for detecting copy number variation |
CN115433777A (en) * | 2022-10-26 | 2022-12-06 | 北京中仪康卫医疗器械有限公司 | Integrated identification method for CNV, SV and SGD abnormalities and abnormal sources of embryos |
Non-Patent Citations (5)
Title |
---|
GUOJUN LIU ET AL.: "RKDOSCNV: A Local Kernel Density-Based Approach to the Detection of Copy Number Variations by Using Next-Generation Sequencing Data", COMPUTATIONAL GENOMICS, vol. 11, 4 November 2020 (2020-11-04), pages 1 - 13 * |
YANG, LIU ET AL.: "Towards the detection of copy number variation from single sperm sequencing in cattle", BMC GENOMICS, vol. 23, no. 01, 30 March 2022 (2022-03-30), pages 1 - 9 * |
刘阳: "基于全基因组关联分析的罕见变异研究", 中国优秀硕士学位论文全文数据库 (基础科学辑), no. 01, 15 January 2019 (2019-01-15), pages 006 - 709 * |
王华梁等: "《医学实验室建设与质量管理》", 30 November 2021, 上海:上海科学技术出版社, pages: 314 - 320 * |
王璟等: "眼睑黄色瘤与高胆固醇血症中遗传因素的相关性研究", 国际眼科杂志, vol. 23, no. 04, 6 April 2023 (2023-04-06), pages 689 - 693 * |
Also Published As
Publication number | Publication date |
---|---|
CN116935961B (en) | 2024-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Talevich et al. | CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing | |
JP6718885B2 (en) | Method and system for copy number variation detection | |
CN113724791B (en) | CYP21A2 gene NGS data analysis method, device and application | |
CN107423578B (en) | Device for detecting somatic cell mutation | |
US20220101944A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
CN109074426A (en) | For detecting the method and system of abnormal karyotype | |
Klassmann et al. | Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data | |
CN114049914B (en) | Method and device for integrally detecting CNV, uniparental disomy, triploid and ROH | |
Jin et al. | Quickly identifying identical and closely related subjects in large databases using genotype data | |
CN111210873B (en) | Exon sequencing data-based copy number variation detection method and system, terminal and storage medium | |
Carvajal-Rodriguez | HacDivSel: two new methods (haplotype-based and outlier-based) for the detection of divergent selection in pairs of populations | |
Chu et al. | GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads | |
Hayes et al. | A model-based clustering method for genomic structural variant prediction and genotyping using paired-end sequencing data | |
CN117334249A (en) | Method, apparatus and medium for detecting copy number variation based on amplicon sequencing data | |
CN116935961B (en) | Source detection method and device for copy number repeat variation | |
Zhao et al. | BreakSeek: a breakpoint-based algorithm for full spectral range INDEL detection | |
O’Fallon et al. | Algorithmic improvements for discovery of germline copy number variants in next-generation sequencing data | |
CN116994651B (en) | Method and device for determining source of chromosome copy number deficiency | |
CN115394359A (en) | Method for identifying human embryonic cell chromosome variation and application | |
CN118302817A (en) | Sequence variation analysis method, system and storage medium | |
CN114913918A (en) | High-throughput sequencing data analysis method and device for autism | |
CN114067909B (en) | Method, device and storage medium for correcting homologous recombination defect score | |
Khan et al. | Assessing the performance of methods for cell clustering from single-cell DNA sequencing data | |
WO2024140880A1 (en) | Copy number variant analysis method and apparatus, and storage medium | |
O’Fallon et al. | Algorithmic improvements for discovery of germline copy number variants in next-generation sequencing data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |