CN111415704A - STR gene data analysis method - Google Patents
STR gene data analysis method Download PDFInfo
- Publication number
- CN111415704A CN111415704A CN202010418467.3A CN202010418467A CN111415704A CN 111415704 A CN111415704 A CN 111415704A CN 202010418467 A CN202010418467 A CN 202010418467A CN 111415704 A CN111415704 A CN 111415704A
- Authority
- CN
- China
- Prior art keywords
- peak
- data
- gene
- namely
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
The invention discloses a method for analyzing STR gene data, which is based on C + + language and carries out positioning full-automatic analysis on STR gene sequencing data of short tandem repeat sequences in DNA, thereby integrally improving the data processing efficiency, finally obtaining a gene region associated with specific characters and providing a rich positioning result display chart. The algorithm supports the analysis of multicolor fluorescence channels; supports the kit commonly used in the market at present; supporting the format of an original data file output by a common sequencer; automatically identifying and removing the miscellaneous peaks generated in the inspection; and automatically calibrating the peak type dislocation generated in the sequencing process. The invention realizes the full-automatic data analysis of the multicolor fluorescent color channel on the DNA gene short serial sequence STR in the work of DNA detection by forensic identification. Breaks through the condition that DNA sequencing only depends on foreign software analysis, lays a progress for developing a domestic sequencer and a domestic reagent in China, and accelerates the development progress of localization in the field of DNA detection by forensic identification in China.
Description
Technical Field
The invention relates to the technical field of STR gene data, in particular to a method for analyzing STR gene data.
Background
The short tandem repeat sequence in microsatellite DNA is short STR, is a DNA polymorphism, widely exists on 23 pairs of chromosomes of a human genome, generally consists of repeat units of 2-6bp, and the difference between different individuals (or alleles) is generally only expressed as the difference of the repeat number of the repeat units. STR loci are important genetic markers, have strong individual recognition capability at loci with high polymorphism degree, and can be simply detected by PCR technology. STR loci have the following characteristics: the STR has a large number of polymorphic loci, the fragment length is generally less than 400bp, the amplification is easy, and the method is suitable for the detection of trace detection materials; the size of the allele is relatively close, and the preferential amplification of the smaller allele is not obvious; the STR loci of different sites have smaller segment length difference, thereby being convenient for multiplex amplification and improving the detection efficiency. In addition, the STR locus analysis method is simple and convenient, and is convenient for standardization and automation of experimental DNA typing and computer storage and networking retrieval of typing data. At present, the STR technology is the DNA typing technology which has the widest application range and the highest use frequency in forensic medicine.
At present, the analysis of the STR short tandem repeat sequence usually needs professional analysis software to read and analyze the original data generated by a sequencer. In the process of reading, analyzing and converting, a complex algorithm is needed to be used for calculation according to a gene specific rule and a mathematical formula. The analysis of the raw data currently relies on foreign analysis software and is limited to sequencing analysis of the 5-color fluorescence channel. Aiming at 8-color and 9-color fluorescence channels, effective analysis cannot be carried out at home, and the overall situation of DNA identification of forensic medicine in China is delayed.
The prior art has the defects that 1, the prior domestic known short tandem repeat STR8 color and 9 color primary data analysis has the phenomena of missing peaks, wrong peaks, misjudgment and the like due to the problems of the analysis method, so that accurate gene sequencing analysis can not be carried out, and the domestic DNA sequencing is greatly hindered. 2. All the kits which are common in insomnia cannot be perfectly compatible, and the analysis fragment combination internal standard rule needs to be adjusted independently according to different kits. 3. The existing judgment method set by analysis software only aims at 5-color fluorescence channels, and the number of detectable loci is rapidly increased due to the continuous improvement of the sensitivity of the existing reagent, so that the 5-color fluorescence channels cannot be met. There is a need for assays that support more color fluorescence channels. 4. The analysis method used at present needs to judge some peak types manually in the data analysis process, and the experimenters have abundant analysis experience to make accurate analysis and judgment, so that the threshold of entrance for analysis is high, and the method cannot be popularized.
Disclosure of Invention
The present invention aims to provide a method for analyzing STR gene data to solve the problems in the background art.
A method for STR gene data analysis comprises the following steps:
step 1: data collection, namely, the original data file in the format of fsa and hid output by a sequencer is in an appointed path, the system refreshes in real time, and a new data file is checked under the appointed path, namely, the new data file is read;
step 2: analyzing original data, namely, determining the displacement position and the data length of each dyeing channel data according to the type and description of each data packet set in a header file by reading the header file data of the original data, and forming a sine-cosine wave peak image in the time direction and the frequency direction by using a Fourier series formula;
and step 3: fluorescence color separation, namely separating each fluorescence channel according to different colors according to internal standard data in the use data, wherein the internal standard data are used as absolute positioning positions of an original map, and a locus configuration file provided by a manufacturer is used as a logic position of the original map;
and 4, step 4: peak value calculation, namely, automatically judging a normal peak and an abnormal peak according to the threshold ratio of each locus range and the high-low peak, namely the peak height difference ratio of heterozygote, and judging whether the peak position is in the locus range or not or whether the peak height ratio exceeds the threshold;
and 5: the regular standard substance ladder comparison calculation-comparison calculation is to compare and calibrate the peak height value of a sample and the gene locus position in a gene locus according to the peak position of the sample and the position of the ladder position by taking a parting standard substance provided by a reagent manufacturer as a standard basis for measuring the peak;
step 6: allele is distributed according to a locus rule, a configuration file provided by a reagent manufacturer predefines the number of loci and loci distributed on each fluorescent color channel, and a system automatically compares and matches the loci and the loci of the allele with the rule according to the rule;
and 7: filtering the mixed peak, setting the peak height value of homozygote and heterozygote to be lower than 50-60 and the peak width to be 0.3-0.7 by presetting the attribute of the mixed peak, judging the mixed peak if the peak width exceeds the range, and judging the mixed peak to be an effective peak if the peak width exceeds the range;
and 8: and (4) displaying the allelic gene peak type, and finally facilitating the identification and judgment of a system and a laboratory technician.
In the above, in the step 3, the internal standard data is provided by a reagent manufacturer and is a reference substance for analyzing a real sample of the standard.
Compared with the prior art, the invention has the following beneficial effects: the full-automatic data analysis of the multicolor fluorescent color channel is realized on the DNA gene short tandem sequence STR in the work of DNA detection by forensic identification. Breaks through the condition that DNA sequencing only depends on foreign software analysis, lays a progress for developing a domestic sequencer and a domestic reagent in China, and accelerates the development progress of localization in the field of DNA detection by forensic identification in China.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram of a ladder showing the rule standard of the present invention.
FIG. 3 is a drawing showing the typing standard of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention is based on C + + language, carry on the full-automatic analysis of location to the STR gene sequencing data of tandem repeat sequence in DNA, raise the data processing efficiency wholly, get the genetic region associated with particular property finally, and provide the abundant display chart of location result. The algorithm supports the analysis of multicolor fluorescence channels; supports the kit commonly used in the market at present; supporting the format of an original data file output by a common sequencer; automatically identifying and removing the miscellaneous peaks generated in the inspection; and automatically calibrating the peak type dislocation generated in the sequencing process.
Referring to fig. 1-3, the present invention provides a method for analyzing STR gene data: the method comprises the following steps:
step 1: data collection, namely, the original data file in the format of fsa and hid output by a sequencer is in an appointed path, the system refreshes in real time, and a new data file is checked under the appointed path, namely, the new data file is read;
step 2: analyzing original data, namely, determining the displacement position and the data length of each dyeing channel data according to the type and description of each data packet set in a header file by reading the header file data of the original data, and forming a sine-cosine wave peak image in the time direction and the frequency direction by using a Fourier series formula;
and step 3: fluorescence color separation-based on the internal standard data (the internal standard data is provided by the reagent manufacturer and is a reference object for analyzing the real sample of the standard) in the use data as the absolute positioning position of the original map, and the locus configuration file provided by the manufacturer as the logic position of the original map, each fluorescence channel is separated according to the logic position of the determined internal standard according to different colors.
And 4, step 4: and peak value calculation, namely, automatically judging a normal peak and an abnormal peak according to the threshold ratio of the range of each locus to the high peak and the low peak (the peak height difference ratio of heterozygote), and judging whether the peak position is in the range of the locus or not or whether the peak height ratio exceeds the threshold.
And 5: the regular standard substance ladder comparison calculation-comparison calculation is to compare and calibrate the peak height value of a sample and the gene locus position in a gene locus according to the peak position of the sample and the position of the ladder position by taking a parting standard substance provided by a reagent manufacturer as a standard basis for measuring the peak;
step 6: allele is distributed according to a locus rule, a configuration file provided by a reagent manufacturer predefines the number of loci and loci distributed on each fluorescent color channel, and a system automatically compares and matches the loci and the loci of the allele with the rule according to the rule;
and 7: filtering the mixed peak, namely presetting the properties of the mixed peak, wherein if the peak height value of homozygote and heterozygote is lower than 50-60 and the peak width is between 0.3-0.7, judging that the mixed peak is the mixed peak if the peak width exceeds the range, and judging that the mixed peak is the effective peak if the peak width exceeds the range;
and 8: and (4) displaying the allelic gene peak type, and finally facilitating the identification and judgment of a system and a laboratory technician.
Compared with the prior art, the invention has the following beneficial effects: the full-automatic data analysis of the multicolor fluorescent color channel is realized on the DNA gene short tandem sequence STR in the work of DNA detection by forensic identification. Breaks through the condition that DNA sequencing only depends on foreign software analysis, lays a progress for developing a domestic sequencer and a domestic reagent in China, and accelerates the development progress of localization in the field of DNA detection by forensic identification in China.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (2)
1. A method for analyzing STR gene data is characterized by comprising the following steps:
step 1: data collection, namely, the original data file in the format of fsa and hid output by a sequencer is in an appointed path, the system refreshes in real time, and a new data file is checked under the appointed path, namely, the new data file is read;
step 2: analyzing original data, namely, determining the displacement position and the data length of each dyeing channel data according to the type and description of each data packet set in a header file by reading the header file data of the original data, and forming a sine-cosine wave peak image in the time direction and the frequency direction by using a Fourier series formula;
and step 3: fluorescence color separation, namely separating each fluorescence channel according to different colors according to internal standard data in the use data, wherein the internal standard data are used as absolute positioning positions of an original map, and a locus configuration file provided by a manufacturer is used as a logic position of the original map;
and 4, step 4: peak value calculation, namely, automatically judging a normal peak and an abnormal peak according to the threshold ratio of each locus range and the high-low peak, namely the peak height difference ratio of heterozygote, and judging whether the peak position is in the locus range or not or whether the peak height ratio exceeds the threshold;
and 5: the regular standard substance ladder comparison calculation-comparison calculation is to compare and calibrate the peak height value of a sample and the gene locus position in a gene locus according to the peak position of the sample and the position of the ladder position by taking a parting standard substance provided by a reagent manufacturer as a standard basis for measuring the peak;
step 6: allele is distributed according to a locus rule, a configuration file provided by a reagent manufacturer predefines the number of loci and loci distributed on each fluorescent color channel, and a system automatically compares and matches the loci and the loci of the allele with the rule according to the rule;
and 7: filtering the mixed peak, setting the peak height value of homozygote and heterozygote to be lower than 50-60 and the peak width to be 0.3-0.7 by presetting the attribute of the mixed peak, judging the mixed peak if the peak width exceeds the range, and judging the mixed peak to be an effective peak if the peak width exceeds the range;
and 8: and (4) displaying the allelic gene peak type, and finally facilitating the identification and judgment of a system and a laboratory technician.
2. The method of claim 1, wherein in step 3, the internal standard data is provided by a reagent manufacturer and is a reference for analyzing real samples with respect to the standard.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010418467.3A CN111415704B (en) | 2020-05-18 | 2020-05-18 | STR gene data analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010418467.3A CN111415704B (en) | 2020-05-18 | 2020-05-18 | STR gene data analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111415704A true CN111415704A (en) | 2020-07-14 |
CN111415704B CN111415704B (en) | 2021-05-18 |
Family
ID=71494978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010418467.3A Active CN111415704B (en) | 2020-05-18 | 2020-05-18 | STR gene data analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111415704B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112185468A (en) * | 2020-12-01 | 2021-01-05 | 南京溯远基因科技有限公司 | Cloud management system and method for gene data analysis and processing |
CN112786107A (en) * | 2021-01-20 | 2021-05-11 | 深圳百人科技有限公司 | Analysis method for multiplex amplification STR data |
CN113793644A (en) * | 2021-09-15 | 2021-12-14 | 宁波海尔施基因科技有限公司 | Quality evaluation method of DNA detection data |
CN114373507A (en) * | 2022-01-27 | 2022-04-19 | 中国科学院北京基因组研究所(国家生物信息中心) | Analysis method of mixed DNA map |
CN115346604A (en) * | 2022-10-20 | 2022-11-15 | 百特元生物科技(北京)有限公司 | DNA sample equilibrium analysis method and device |
CN115346607A (en) * | 2022-10-20 | 2022-11-15 | 百特元生物科技(北京)有限公司 | DNA sample duplication checking method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070202526A1 (en) * | 2006-02-28 | 2007-08-30 | Hitachi Software Engineering Co., Ltd. | Genotyping result evaluation method and system |
CN108052797A (en) * | 2017-12-28 | 2018-05-18 | 上海嘉因生物科技有限公司 | Detection method applied to Binding site for transcription factor on chromosome in tissue samples |
CN109712673A (en) * | 2018-12-24 | 2019-05-03 | 江苏师范大学 | A kind of method of quick export Terminal restriction fragment length polymorphism data |
CN110289048A (en) * | 2019-07-05 | 2019-09-27 | 广西壮族自治区水牛研究所 | QTL relevant to buffalo milk production trait and its screening technique and application |
US10504614B2 (en) * | 2013-10-07 | 2019-12-10 | Rutgers, The State University Of New Jersey | Systems and methods for determining an unknown characteristic of a sample |
CN110706746A (en) * | 2019-11-27 | 2020-01-17 | 北京博安智联科技有限公司 | DNA mixed typing database comparison algorithm |
-
2020
- 2020-05-18 CN CN202010418467.3A patent/CN111415704B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070202526A1 (en) * | 2006-02-28 | 2007-08-30 | Hitachi Software Engineering Co., Ltd. | Genotyping result evaluation method and system |
US10504614B2 (en) * | 2013-10-07 | 2019-12-10 | Rutgers, The State University Of New Jersey | Systems and methods for determining an unknown characteristic of a sample |
CN108052797A (en) * | 2017-12-28 | 2018-05-18 | 上海嘉因生物科技有限公司 | Detection method applied to Binding site for transcription factor on chromosome in tissue samples |
CN109712673A (en) * | 2018-12-24 | 2019-05-03 | 江苏师范大学 | A kind of method of quick export Terminal restriction fragment length polymorphism data |
CN110289048A (en) * | 2019-07-05 | 2019-09-27 | 广西壮族自治区水牛研究所 | QTL relevant to buffalo milk production trait and its screening technique and application |
CN110706746A (en) * | 2019-11-27 | 2020-01-17 | 北京博安智联科技有限公司 | DNA mixed typing database comparison algorithm |
Non-Patent Citations (2)
Title |
---|
SHOTA INOKUCHI 等: "Non-specific peaks generated by animal DNA during human STR analysis:Peak characteristics and a novel analysis method for mixed human/animal samples", 《FORENSIC SCIENCE INTERNATIONAL:GENETICS 》 * |
贾二惠 等: "基于动态规划的DNA碱基识别峰匹配方法的设计与实现", 《分析仪器》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112185468A (en) * | 2020-12-01 | 2021-01-05 | 南京溯远基因科技有限公司 | Cloud management system and method for gene data analysis and processing |
CN112786107A (en) * | 2021-01-20 | 2021-05-11 | 深圳百人科技有限公司 | Analysis method for multiplex amplification STR data |
CN113793644A (en) * | 2021-09-15 | 2021-12-14 | 宁波海尔施基因科技有限公司 | Quality evaluation method of DNA detection data |
CN114373507A (en) * | 2022-01-27 | 2022-04-19 | 中国科学院北京基因组研究所(国家生物信息中心) | Analysis method of mixed DNA map |
CN114373507B (en) * | 2022-01-27 | 2022-07-05 | 中国科学院北京基因组研究所(国家生物信息中心) | Analysis method of mixed DNA map |
CN115346604A (en) * | 2022-10-20 | 2022-11-15 | 百特元生物科技(北京)有限公司 | DNA sample equilibrium analysis method and device |
CN115346607A (en) * | 2022-10-20 | 2022-11-15 | 百特元生物科技(北京)有限公司 | DNA sample duplication checking method and device |
CN115346604B (en) * | 2022-10-20 | 2023-02-10 | 百特元生物科技(北京)有限公司 | DNA sample equilibrium analysis method and device |
CN115346607B (en) * | 2022-10-20 | 2023-02-10 | 百特元生物科技(北京)有限公司 | DNA sample duplication checking method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111415704B (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111415704B (en) | STR gene data analysis method | |
CN113744807B (en) | Macrogenomics-based pathogenic microorganism detection method and device | |
CN109346130B (en) | Method for directly obtaining micro-haplotype from whole genome re-sequencing data and typing micro-haplotype | |
CN110993029B (en) | Method and system for detecting chromosome abnormality | |
CN111052249B (en) | Methods of determining predetermined chromosome conservation regions, methods of determining whether copy number variation exists in a sample genome, systems, and computer readable media | |
CN113913539B (en) | Molecular marker related to chicken skin yellowness character and application thereof | |
CN110021346B (en) | Gene fusion and mutation detection method and system based on RNAseq data | |
CN112289384B (en) | Construction method and application of citrus whole genome KASP marker library | |
CN111808983B (en) | Rubber tree variety standard DNA fingerprint spectrum library and construction method and special primer thereof | |
CN116030892B (en) | System and method for identifying chromosome reciprocal translocation breakpoint position | |
CN110846429A (en) | Corn whole genome InDel chip and application thereof | |
CN110444253B (en) | Method and system suitable for mixed pool gene positioning | |
CN108220473B (en) | Identification of maize S-type cytoplasmic male sterile material by using chloroplast InDel marker | |
LU502479B1 (en) | Group of snp loci and method for identifying biogeographic origins of east asian populations | |
CN115141893B (en) | Molecular marker group containing 7 molecular markers for predicting dry matter content of kiwi fruits, application of molecular marker group and kit | |
CN114530200B (en) | Mixed sample identification method based on calculation of SNP entropy | |
CN110016498B (en) | Method for determining single nucleotide polymorphism in Sanger method sequencing | |
Mattocks et al. | Comparative sequence analysis | |
CN114790493B (en) | MNP (MNP) marking site of herpes simplex virus, primer composition, kit and application of MNP marking site | |
US20240315184A1 (en) | Snp molecular marker combination of brassica napus l. and application method thereof | |
CN118638935A (en) | SNP molecular marker closely linked with feather color of Jia Ji duck and application thereof | |
CN115029453A (en) | MNP (protein-protein) marker site of streptococcus pyogenes, primer composition, kit and application of MNP marker site | |
CN115029454A (en) | MNP (MNP) marker locus of Moraxella catarrhalis, primer composition, kit and application thereof | |
CN117965779A (en) | Application of SNP molecular marker for detecting purity of pepper variety, KASP primer combination and application thereof | |
CN114410829A (en) | Molecular marker related to color of purple cauliflower and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |