CN116935961B - Source detection method and device for copy number repeat variation - Google Patents

Source detection method and device for copy number repeat variation Download PDF

Info

Publication number
CN116935961B
CN116935961B CN202310851930.7A CN202310851930A CN116935961B CN 116935961 B CN116935961 B CN 116935961B CN 202310851930 A CN202310851930 A CN 202310851930A CN 116935961 B CN116935961 B CN 116935961B
Authority
CN
China
Prior art keywords
mutation
data
family
point
point mutation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310851930.7A
Other languages
Chinese (zh)
Other versions
CN116935961A (en
Inventor
何杰
窦浩宇
刘永初
燕攀
刘阳
李阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Anji Kanger Medical Laboratory
Original Assignee
Shenzhen Anji Kanger Medical Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Anji Kanger Medical Laboratory filed Critical Shenzhen Anji Kanger Medical Laboratory
Priority to CN202310851930.7A priority Critical patent/CN116935961B/en
Publication of CN116935961A publication Critical patent/CN116935961A/en
Application granted granted Critical
Publication of CN116935961B publication Critical patent/CN116935961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a source detection method and a device for copy number repeated variation, wherein the method comprises the following steps: after obtaining mutation data about copy number repeated mutation and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data; calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type; determining the family source of the mutation data based on the point mutation density distribution ratio value. The invention can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.

Description

Source detection method and device for copy number repeat variation
Technical Field
The invention relates to the technical field of chromosome detection, in particular to a source detection method and device for copy number repeated variation.
Background
Along with the development of technology, the gene detection technology is becoming usual, and the application scene is becoming wide. One common technique for detecting abnormal genes is a technique for detecting the origin of copy number repeat variation by scanning the gene or whole genome to find the DNA sequence of the repeat variation in the gene, determining a biological phenotype based on the DNA sequence of the repeat variation, and matching the biological phenotype with the male parent phenotype or female parent phenotype to determine the origin of the copy number repeat variation in the chromosome. However, the phenotype and the repeated variation of the copy number may be related or unrelated, and the difference of the phenotypes may be influenced by environmental factors, so that only the phenotype is matched with the phenotype of the male parent or the phenotype of the female parent, the detection result is greatly different from the actual result, and the detection accuracy is low.
Disclosure of Invention
The invention provides a source detection method and a device for copy number variation, wherein the method can acquire copy number repeated variation data and determine related family point mutation data, and determine variation sources according to the density distribution ratio of the family point mutation data and the repeated variation data so as to attach to a genetic principle, thereby improving the detection precision and accuracy.
A first aspect of an embodiment of the present invention provides a method for detecting a source of duplicate variation in copy number, the method comprising:
After obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data;
Calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
determining the family source of the mutation data based on the point mutation density distribution ratio value.
In a possible implementation manner of the first aspect, the point mutation density distribution proportion value includes: a first distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
And converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
In a possible implementation manner of the first aspect, the point mutation density distribution ratio value further includes: a second distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
And converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
In a possible implementation manner of the first aspect, the determining the source of the mutation data based on the point mutation density distribution ratio value includes:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
If the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
In a possible implementation manner of the first aspect, the determining the source of the mutation data based on the point mutation density distribution ratio value further includes:
If neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
In a possible implementation manner of the first aspect, the preset kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
In a possible implementation manner of the first aspect, the obtaining family point mutation data corresponding to the mutation data includes:
Acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
A second aspect of embodiments of the present invention provides a source detection device for copy number repeat variation, the device comprising:
The acquisition and determination module is used for determining the previous generation mutation type corresponding to the family point mutation data and determining the current generation mutation type corresponding to the mutation data after acquiring the mutation data related to the repeated mutation of the copy number and acquiring the family point mutation data corresponding to the mutation data;
The distribution proportion value calculating module is used for calculating the point mutation density distribution proportion value of the mutation data accounting for the family point mutation data according to the previous generation mutation type and the current generation mutation type;
and a determining family source module for determining family sources of the mutation data based on the point mutation density distribution ratio value.
Compared with the prior art, the source detection method and device for copy number repeated variation provided by the embodiment of the invention have the beneficial effects that: the invention can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.
Drawings
FIG. 1 is a flow chart of a method for detecting duplicate copy number variation according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for detecting a source of copy number repeat variation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of copy number results provided by an embodiment of the present invention;
FIG. 4 is a graph showing the distribution of the proportion density of point mutations of the prior evidence corresponding to two combinations in the case of three copies according to one embodiment of the present invention;
FIG. 5 is a schematic diagram of a source detection device for copy number repeat variation according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the above problems, a source detection method for copy number repeat variation provided in the present embodiments will be described and illustrated in detail by the following specific examples.
Referring to fig. 1, a flow chart of a method for detecting a source of copy number duplication variation according to an embodiment of the invention is shown.
In one embodiment, the method is applicable to a computer system, and the abnormal gene data can be input into the computer system, and the computer system is used for detecting and analyzing the abnormal gene data or gene fragments to determine the specific source of the abnormal gene data.
The method for detecting the source of the copy number repetitive variation may include:
S11, after obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data.
One common type of variation in copy number variation is the repeat variation type, wherein the most common type of expression in the human genome is also a triploid of a chromosome segment, and by detecting the source of the repeat variation data, subsequent gene detection can be effectively performed according to the source thereof.
In one embodiment, the mutation data about repeated variation in chromosome copy number may include a mutation fragment to be detected (may be a mutated gene fragment in particular) and a mutation type (possible hybridization type, homozygous type, etc.). The family point mutation data may be data of variation points included in each family member corresponding to the variation data. Wherein, family members may be family members of a direct family relatives.
In an embodiment, determining the previous generation mutation type corresponding to the family point mutation data may be determining a gene type of the family member in the family point mutation data, which generates copy number repeated mutation;
the current mutation type corresponding to the mutation data is determined, and the current mutation type can be specifically the gene type of the mutation data with repeated mutation of the copy number.
The above types may be homozygous mutations or wild-type mutations.
Because there may be a plurality of each family member corresponding to the mutation data, if gene data of all family members are collected for detection, the amount of data to be processed is large, and the time consumption is long.
As an example, step S11 may include the following sub-steps:
S111, acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is gene data of a family member corresponding to the mutation data.
In an alternative embodiment, family sequencing raw data for a number of family members may be obtained, which may be genetic data for the family members.
In genetic inheritance, the influence of the alternate genes is smaller, in order to further reduce the amount of processed data, in a preferred embodiment, family sequencing original data of a father parent corresponding to the mutation data (namely family sequencing original data of a father parent and family sequencing original data of a mother) can be obtained, the mutation sources are judged according to the mutation fragments and the dissimilarity between all the point mutation information of three families of the father parent and the mother parent, so that the inheritance characteristics of chromosomes and the characteristic of wide distribution of the point mutations can be reserved, and the fixed characteristics shown by different copy number mutation types can be relied on, so that the judgment of the results is completed, and the detection accuracy is improved.
In an embodiment, after obtaining the family sequencing original data of the male parent and the family sequencing original data of the female parent, the steps of data cleaning, data quality control, sequencing data comparison, mutation detection, mutation information filtering, mutation data annotation and the like can be sequentially performed on the obtained family sequencing original data and the obtained family sequencing original data respectively to form family processing data.
Alternatively, the processed family sequencing raw data may be converted into a family point variation information summary table containing variation information of all members of the family for subsequent processing.
In one embodiment, all the point mutation information of the family members can be recorded in a family point mutation information summary table, wherein the information includes mutation conditions of the point mutation (including base information before and after mutation), quality values of the point mutation, heterozygous conditions of the point mutation, counting information of the point mutation (recording the number of sequencing fragments with mutation at the current mutation site and the number of sequencing fragments covering the current site), and the proportion of the point mutation (the proportion of the number of sequencing fragments with mutation at the current base site to the number of sequencing fragments covering the current site), and the point mutation information is necessary information for subsequently judging the source of copy number mutation.
And S112, merging the point mutation data contained in the family processing data, and extracting a union set of the merged data to obtain a point mutation data set.
In one embodiment, after preprocessing is completed, point mutation data of family processing data of a male parent can be extracted, point mutation data of family processing data of a female parent can be extracted, and then mutation data of the two point mutation data are combined to form a data set. And extracting the same point mutation data in the two family processing data in the data set, namely obtaining the union of the two data to form a point mutation data set.
And S113, carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain the family point mutation data.
In order to further reduce the amount of processed data, the mutation fragments contained in the mutation data can be obtained, then the same fragments are screened in the point mutation data set according to the mutation fragments, and the same mutation fragments obtained by screening are formed into a data set to form the family point mutation data.
In yet another alternative embodiment, the point mutation data set includes point mutation information owned by both the male parent and the female parent, and after all the point mutation information is obtained, the point mutations whose quality values do not satisfy the minimum threshold value may be filtered, so as to reduce the number of unnecessary or useless point mutation information.
Specifically, the user can preset the segment to be judged, then analyze the position of the preset segment, and screen out all the point mutation information of the family members in the segment according to the position. And after screening to obtain the point mutation information of all members of the family on the current fragment, starting to judge the mutation sources of different types.
In yet another alternative embodiment, in the screening of point mutations, it may be considered to delete point mutations of the same mutation type (i.e., point mutations are identical and are all homozygous, wild-type or homozygous) of family members in order to make the results more prone.
In addition, in some cases, it is also possible to find new chromosomal copy number variation segments by comprehensively analyzing all the point mutation information contained on a specific chromosomal segment of all members of the family.
S12, calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type.
Because the family point mutation data comprises the point mutation data of the male parent and the point mutation data of the female parent, the previous generation mutation type corresponding to the family point mutation data can comprise the mutation type of the male parent or the mutation type of the female parent.
After the previous generation mutation type and the current generation mutation type are determined, the current generation mutation type and the previous generation mutation type can be combined, and the point mutation density distribution proportion value of the point mutation quantity contained in the mutation data and the point mutation quantity contained in the family point mutation data is calculated, so that the mutation source can be determined according to the distribution proportion value.
In an alternative embodiment, the point mutation density distribution ratio value includes: a first distribution ratio value;
as an example, step S12 may include the following sub-steps:
S121, if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, and the current generation mutation type is heterozygous mutation, extracting a first point mutation proportion array from the mutation data.
S122, converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
For triploid repetition of chromosome fragments, most of them are caused by providing one chromosome on either of the parents, so that the core of determining the source is to determine the source of one chromosome in excess of the users (also called pre-markers) of the mutation data to be detected.
Because the prover contains three chromosome fragments, both parents inherit at least one section of chromosome to the prover, at the moment, the prover cannot directly start judging from the parents due to the three chromosome fragments, but needs to start from the prover, check the point mutation combination form of the three chromosomes of the prover, and then contact the mutation combination form of both parents at the points to judge the source of the redundant chromosome.
Starting from the forensics, when screening whether the forensics are homozygous or wild-type mutations, it is not possible to determine which party the excess chromosomes of the forensics are from, regardless of the point mutation types of both parents.
For example, when the forensics are homozygous, both parents must be homozygous or heterozygous, and when both parents are homozygous, it cannot be judged; when parents are heterozygous, the parents cannot judge the parents; when one is heterozygous and the other is homozygous, the homozygous one may inherit both copies to the prover, while the heterozygous one may recombine at meiosis and also inherit both homozygous copies to the prover, so that when the prover is homozygous or wild-type mutant, the source of the variation cannot be determined.
When screening the precursor point mutation into the heterozygous mutation, any mutation position should contain two mutation bases and one wild base or one mutation base and two wild bases, so that the ratio of the mutation bases in these heterozygous mutation positions to all bases should be 1/3 or 2/3, and thus the ratio of the number of sequencing fragments containing the mutation to the number of sequencing fragments covering this position should be almost equal to 1/3 or 2/3, and the origin of the redundant chromosome can be determined by using this ratio. In the case of such heterozygous mutations, if one of the parents is heterozygous, it is not possible to determine which party inherited the point mutation to the forensic person; if both parents are selected as homozygous mutation and wild mutation, respectively, the source can be judged according to the heterozygosity ratio of the forerunner in the two cases.
Specifically, when parents are selected to be homozygous and wild mutant respectively, all point mutations related to heterozygous mutation can be extracted from mutation data of a prover, and an array formed by all mutation ratios of the point mutations is obtained to obtain a first point mutation ratio array.
And solving the density distribution of the mutation proportion by using a preset nuclear density Estimation function (KERNEL DENSITY Estimation), so as to obtain the mutation proportion of the highest point in the density distribution. The preset kernel density estimation function principle is to convert a histogram similar to frequency distribution into a smooth density distribution diagram through a kernel function, so as to obtain a first density distribution curve.
Then, the most common gaussian kernel for density estimation can be used. The highest mutation density distribution value obtained by fitting the kernel density estimation function to the first density distribution curve is the range of the main distribution of the mutation proportion under the current screening condition, namely the first distribution proportion value. Further, the source of the repeated chromosome can be judged according to the judgment logic through the value.
In an alternative embodiment, the point mutation density distribution ratio value includes: a second distribution ratio value;
as an example, step S12 may include the following sub-steps:
S123, when the previous generation mutation type is a wild mutation and the female parent is a homozygous mutation, extracting a second point mutation proportion array from the mutation data when the current generation mutation type is a heterozygous mutation.
S124, converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
In this embodiment, steps S123-S124 are the same as steps S121-S122, and the analysis described above may be specifically performed.
The operation is specifically as follows: when parents are homozygous and wild mutation respectively are screened, all point mutations related to heterozygous mutation can be extracted from mutation data of a precursor, and an array formed by all mutation ratios of the point mutations is obtained to obtain a first point mutation ratio array.
And solving the density distribution of the mutation proportion by using a preset nuclear density Estimation function (KERNEL DENSITY Estimation), so as to obtain the mutation proportion of the highest point in the density distribution. The preset kernel density estimation function principle is to convert a histogram similar to frequency distribution into a smooth density distribution diagram through a kernel function, so as to obtain a second density distribution curve.
Then, the most common gaussian kernel for density estimation can be used. The highest mutation density distribution value obtained by fitting the kernel density estimation function to the second density distribution curve is the range of the main distribution of the mutation proportion under the current screening condition, namely the second distribution proportion value. Further, the source of the repeated chromosome can be judged according to the judgment logic through the value.
In one embodiment, the preset kernel density estimation functions used in steps S121-S122 and steps S123-S124 are as follows:
The kernel function K used in the above formula, that is, the preset kernel functions used in steps S121 to S122 and steps S123 to S124, is a gaussian distribution kernel function, as described in the following formula:
In the above formula, the value p obtained by the function is the probability of being distributed at a certain value x, the value h in the formula is the set seed width (bandwidth), the larger the seed width is, the smoother the curve is, n represents the number of all points appearing in the seed width, and all Xi appearing in the seed width range of x will participate in the calculation of the point density value by the definition of the kernel density estimation function.
S13, determining the family source of the mutation data based on the point mutation density distribution proportion value.
In an embodiment, after the first distribution ratio value and the second distribution ratio value are calculated, it may be determined whether the family source of the variation data is the father or the mother according to the values of the first distribution ratio value and the second distribution ratio value.
In one embodiment, step S13 may include the sub-steps of:
s131, if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent.
S132, if the first distribution ratio value is smaller than a second preset value and the second distribution ratio value is larger than the first preset value, determining that the family source of the variation data is a female parent.
S133, if the first distribution ratio value or the second distribution ratio value does not meet a preset value, determining the family source of the variation data.
Specifically, after the first distribution ratio value and the second distribution ratio value are calculated, the repeated chromosome source side can significantly change the point mutation heterozygosity ratio of the precursor because one more chromosome segment is provided.
When screening the first evidence as heterozygous mutation, if one more chromosome from one parent is provided, the homozygous mutation provided by the first evidence is combined with the wild mutation provided by the other parent, and the heterozygous ratio of the first evidence at the sites is about 2/3; when the wild mutation provided by it is combined with the homozygous mutation provided by the other party, the heterozygous ratio of the precursor should be approximately 1/3.
In the actual judgment, the density distribution maximum values are generally distributed in the vicinity of 1/3 and 2/3 and are not completely equal due to the influence of various factors, so that the density distribution maximum values d1 and d2 formed according to the two combinations are compared with a first preset value (2/3 corresponds to approximately 0.6) and a second preset value (1/3 corresponds to approximately 0.4), and if one of the two distribution ratio values is larger than the first preset value (0.6) and the other is smaller than the second preset value (0.4), the genetic source of the redundant chromosome can be clearly judged.
The method comprises the following steps: the first distribution proportion value is larger than 0.6 and the second distribution proportion value is smaller than 0.4, the family source of the variation data is determined to be the male parent, the first distribution proportion value is smaller than 0.4 and the second distribution proportion value is larger than 0.6, and the family source of the variation data is determined to be the female parent. If the two cases are not the case, the family source of the mutation data cannot be determined.
Referring to FIG. 2, a flowchart illustrating an exemplary method for detecting a source of copy number repeat variation is shown.
Specifically, the mutation data such as the position of the mutation fragment to be detected or judged and the mutation type of the mutation fragment to be judged can be obtained, and standard family sequencing original data of a person to be detected can be obtained, and preprocessing operations such as data cleaning, quality control, comparison, mutation detection, data filtering and the like are performed on the family sequencing original data; combining the point mutation data of family members, and taking the union of all the point mutation data; and screening out all family point mutation data on the fragments from the union according to the positions of the variant fragments. When the first evidence is heterozygous mutation, the father is homozygous mutation, the mother is wild mutation, an array of the point mutation proportion of the first evidence can be obtained, and then the array is adopted to calculate the distribution proportion value of the first evidence, so as to obtain a first distribution proportion value; similarly, when the first evidence is heterozygous mutation, the father is wild mutation, and the mother is homozygous mutation, an array of the point mutation ratios of the first evidence can be obtained, and then the array is adopted to calculate the distribution ratio value of the first evidence, so as to obtain a second distribution ratio value.
If the first distribution proportion value is larger than 0.6 and the second distribution proportion value is smaller than 0.4, determining the family source of the variation data as a male parent; if the first distribution ratio value is smaller than 0.4 and the second distribution ratio value is larger than 0.6, determining the family source of the variation data as the female parent. If the two cases are not the case, the family source of the mutation data cannot be determined.
Referring to FIG. 3, a schematic diagram of copy number results provided by an embodiment of the present invention is shown.
In a test example for determining the origin of duplication of copy number, the results of copy number detected for genes on fragments are shown in FIG. 3 below. The vast majority of genes on this fragment were three copies, so this fragment data was used as test data.
After screening all the point mutation data on the segment of the test example in fig. 3, the obtained two sets of the mutation ratio array of all the point mutations whose forensics are heterozygous father homozygous mother are wild mutations and whose mutation ratio array of all the point mutations whose forensics are heterozygous father wild mother are (0.69,0.75,0.76,0.62,0.63,0.69,0.71,0.67,0.62,0.64,0.5,0.67,0.89,0.64,0.66,0.82,0.73,0.62,0.65,0.71,0.68,0.62,0.71,0.63,0.77,0.99,0.62,0.67,0.66,0.74,0.66,0.67,0.74,0.76,0.6,0.6,0.71,0.65,0.6,0.65,0.69,0.65,0.65,0.67,0.65,0.6,0.82,0.68,0.6,0.72,0.79,0.71,0.69), are (0.33,0.3,0.33,0.36,0.38,0.26,0.31,0.36,0.3,0.31,0.31,0.31,0.28,0.32,0.32,0.23,0.23,0.25,0.32,0.36,0.36,0.35,0.31,0.4,0.33,0.33,0.43,0.28,0.28,0.33,0.33,0.34,0.29,0.28,0.31,0.28,0.4,0.39,0.36,0.26,0.21,0.2,0.37,0.83,0.35,0.39,0.34,0.34,0.22,0.22,0.31,0.28), are converted by the kernel density estimation function, and the obtained graphs are shown in fig. 4 respectively.
As shown by the results in FIG. 4, the mutation ratios in both cases were close to 2/3 and 1/3, respectively, so that it could be judged that one more chromosomal segment was inherited by the father on this chromosomal segment to the forerunner according to the judgment of FIG. 2.
The method for judging the copy number repeat source can identify whether the source of the repeated chromosome fragment causing the copy number repeat is a father or a mother, so that the corresponding disease pathogenic source can be predicted through gene assistance.
It should be noted that, when it is detected that a certain chromosomal segment in the genome of a prover is three copies according to some means for detecting CNV, since such a repetitive segment may affect some gene functions of the prover, thereby causing an abnormal phenotype to appear in the prover, in order to further enhance the understanding of the copy number variation of the chromosome, it is necessary to determine the relationship between an extra chromosomal segment in the genome of the prover and the abnormal phenotype, and determining the source of the repetitive chromosomal segment can analyze which parent the source of the variant repetitive segment is, so as to enhance the understanding of the disease suffered by the prover, and to conduct intensive study on the disease pathogenesis.
Both procedures for determining the source of the fragment copy number variation require at least the original data of whole exon sequencing or whole genome sequencing of a standard three-person family comprising a forerunner, and point mutation data which can be used for determining the source of the variation are finally obtained according to the processing of the original data.
In this embodiment, the embodiment of the present invention provides a source detection method for copy number repeat variation, which has the following beneficial effects: the invention can acquire the repeated variation data of the copy number and determine the related family point mutation data, calculate the density distribution proportion value of the point mutation number according to the mutation type of the family point mutation data and the repeated variation data mutation type, determine the variation source based on the density distribution proportion value, and further attach to the genetic principle so as to improve the detection precision and accuracy.
The embodiment of the invention also provides a source detection device for copy number repeat variation, and referring to fig. 5, a schematic structure diagram of the source detection device for copy number repeat variation according to an embodiment of the invention is shown.
Wherein, as an example, the source detection device for copy number repetition variation may include:
An acquisition and determination module 501, configured to determine a previous generation mutation type corresponding to the family point mutation data and determine a current generation mutation type corresponding to the mutation data after acquiring mutation data related to copy number repeated mutation and acquiring family point mutation data corresponding to the mutation data;
The distribution ratio calculating module 502 is configured to calculate a point mutation density distribution ratio of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
a determining family source module 503, configured to determine a family source of the mutation data based on the point mutation density distribution ratio value.
Optionally, the point mutation density distribution ratio value includes: a first distribution ratio value;
the module for calculating the distribution proportion value is further used for:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
And converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value.
Optionally, the point mutation density distribution ratio value further includes: a second distribution ratio value;
the module for calculating the distribution proportion value is further used for:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
And converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
Optionally, the determining family source module is further configured to:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
If the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
Optionally, the determining family source module is further configured to:
If neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
Optionally, the preset kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
Optionally, the acquiring and determining module is further configured to:
Acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
It will be clearly understood by those skilled in the art that, for convenience and brevity, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
Further, an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed implements the source detection method for copy number repeat variation as described in the above embodiments.
Further, an embodiment of the present application also provides a computer-readable storage medium storing a computer-executable program for causing a computer to execute the source detection method for copy number duplication variation as described in the above embodiment.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (8)

1. A method for detecting a source of repeat variation in copy number, the method comprising:
After obtaining mutation data about repeated mutation of the copy number and obtaining family point mutation data corresponding to the mutation data, determining a previous generation mutation type corresponding to the family point mutation data and determining a current generation mutation type corresponding to the mutation data;
Calculating a point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type;
determining a family source of the variant data based on the point mutation density distribution ratio value;
The point mutation density distribution ratio value comprises: a first distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
Converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value;
The point mutation density distribution ratio value further includes: a second distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
And converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
2. The method for detecting a source of duplicate variation in copy number according to claim 1, wherein said determining a source of said variation data based on said point mutation density distribution ratio value comprises:
if the first distribution proportion value is larger than a first preset value and the second distribution proportion value is smaller than a second preset value, determining that the family source of the variation data is a male parent;
If the first distribution proportion value is smaller than a second preset value and the second distribution proportion value is larger than the first preset value, determining that the family source of the variation data is a female parent.
3. The method for detecting a source of duplicate variation in copy number according to claim 2, wherein said determining a source of said variation data based on said point mutation density distribution ratio value further comprises:
If neither the first distribution ratio value nor the second distribution ratio value satisfies a preset value, determining the family source of the variation data.
4. The method of claim 1, wherein the predetermined kernel density estimation function is as follows:
the preset Gaussian distribution kernel function is as follows:
5. A method for detecting a source of repetitive variation in copy number according to any one of claims 1 to 3, wherein said obtaining family point mutation data corresponding to said variation data comprises:
Acquiring a plurality of family sequencing original data corresponding to the variation data, and preprocessing each family sequencing original data to obtain family processing data, wherein the preprocessing comprises the following steps: data cleaning, data quality control, data comparison, mutation detection and data filtering, wherein each family sequencing original data is the gene data of the family member corresponding to the mutation data;
combining the point mutation data contained in the family processing data, and extracting a union set of the combined data to obtain a point mutation data set;
and carrying out data screening on the point mutation data set according to the mutation fragments corresponding to the mutation data to obtain family point mutation data.
6. A source detection device for copy number repeat variation, the device comprising:
The acquisition and determination module is used for determining the previous generation mutation type corresponding to the family point mutation data and determining the current generation mutation type corresponding to the mutation data after acquiring the mutation data related to the repeated mutation of the copy number and acquiring the family point mutation data corresponding to the mutation data;
The distribution proportion value calculating module is used for calculating the point mutation density distribution proportion value of the mutation data accounting for the family point mutation data according to the previous generation mutation type and the current generation mutation type;
A determining family source module for determining family sources of the mutation data based on the point mutation density distribution ratio values;
The point mutation density distribution ratio value comprises: a first distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is homozygous mutation and the female parent is wild mutation, the current generation mutation type is heterozygous mutation, and a first point mutation proportion array is extracted from the mutation data;
Converting the first point mutation proportion array into a first density distribution curve by using a preset kernel density estimation function, fitting the first density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the first density distribution curve to obtain a first distribution proportion value;
The point mutation density distribution ratio value further includes: a second distribution ratio value;
The calculating the point mutation density distribution proportion value of the mutation data to the family point mutation data according to the previous generation mutation type and the current generation mutation type comprises the following steps:
if the previous generation mutation type is a wild mutation of a male parent and a homozygous mutation of a female parent, and the current generation mutation type is a heterozygous mutation, extracting a second point mutation proportion array from the mutation data;
And converting the second point mutation proportion array into a second density distribution curve by using a preset kernel density estimation function, fitting the second density distribution curve by using a preset Gaussian distribution kernel function, and obtaining a distribution proportion value corresponding to the highest point of the second density distribution curve to obtain a second distribution proportion value.
7. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the source detection method for copy number repeat variation as claimed in any one of claims 1 to 5 when the program is executed.
8. A computer-readable storage medium storing a computer-executable program for causing a computer to execute the source detection method for copy number duplication variation of any one of claims 1 to 5.
CN202310851930.7A 2023-07-12 2023-07-12 Source detection method and device for copy number repeat variation Active CN116935961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310851930.7A CN116935961B (en) 2023-07-12 2023-07-12 Source detection method and device for copy number repeat variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310851930.7A CN116935961B (en) 2023-07-12 2023-07-12 Source detection method and device for copy number repeat variation

Publications (2)

Publication Number Publication Date
CN116935961A CN116935961A (en) 2023-10-24
CN116935961B true CN116935961B (en) 2024-06-25

Family

ID=88382035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310851930.7A Active CN116935961B (en) 2023-07-12 2023-07-12 Source detection method and device for copy number repeat variation

Country Status (1)

Country Link
CN (1) CN116935961B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785899A (en) * 2019-02-18 2019-05-21 东莞博奥木华基因科技有限公司 A kind of device and method of genotype correction
CN115433777A (en) * 2022-10-26 2022-12-06 北京中仪康卫医疗器械有限公司 Integrated identification method for CNV, SV and SGD abnormalities and abnormal sources of embryos

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3658687A1 (en) * 2017-07-25 2020-06-03 Sophia Genetics S.A. Methods for detecting biallelic loss of function in next-generation sequencing genomic data
CN111341383B (en) * 2020-03-17 2021-06-29 安吉康尔(深圳)科技有限公司 Method, device and storage medium for detecting copy number variation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785899A (en) * 2019-02-18 2019-05-21 东莞博奥木华基因科技有限公司 A kind of device and method of genotype correction
CN115433777A (en) * 2022-10-26 2022-12-06 北京中仪康卫医疗器械有限公司 Integrated identification method for CNV, SV and SGD abnormalities and abnormal sources of embryos

Also Published As

Publication number Publication date
CN116935961A (en) 2023-10-24

Similar Documents

Publication Publication Date Title
Talevich et al. CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing
JP6718885B2 (en) Method and system for copy number variation detection
CN107423578B (en) Device for detecting somatic cell mutation
US20220101944A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN109949861B (en) Tumor mutation load detection method, device and storage medium
CN113724791B (en) CYP21A2 gene NGS data analysis method, device and application
Adetunji et al. Variant analysis pipeline for accurate detection of genomic variants from transcriptome sequencing data
CN109074426A (en) For detecting the method and system of abnormal karyotype
CN114049914B (en) Method and device for integrally detecting CNV, uniparental disomy, triploid and ROH
Klassmann et al. Detecting selection using extended haplotype homozygosity (EHH)-based statistics in unphased or unpolarized data
Chu et al. GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads
Jin et al. Quickly identifying identical and closely related subjects in large databases using genotype data
Hayes et al. A model-based clustering method for genomic structural variant prediction and genotyping using paired-end sequencing data
Carvajal-Rodriguez HacDivSel: two new methods (haplotype-based and outlier-based) for the detection of divergent selection in pairs of populations
CN117334249A (en) Method, apparatus and medium for detecting copy number variation based on amplicon sequencing data
CN116935961B (en) Source detection method and device for copy number repeat variation
Zhao et al. BreakSeek: a breakpoint-based algorithm for full spectral range INDEL detection
CN111210873A (en) Exon sequencing data-based copy number variation detection method and system, terminal and storage medium
CN115394359A (en) Method for identifying human embryonic cell chromosome variation and application
CN116994651A (en) Method and device for determining source of chromosome copy number deficiency
O’Fallon et al. Algorithmic improvements for discovery of germline copy number variants in next-generation sequencing data
CN114913918A (en) High-throughput sequencing data analysis method and device for autism
CN114067909B (en) Method, device and storage medium for correcting homologous recombination defect score
Khan et al. Assessing the performance of methods for cell clustering from single-cell DNA sequencing data
WO2024140880A1 (en) Copy number variant analysis method and apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant