CN111415704A

CN111415704A - STR gene data analysis method

Info

Publication number: CN111415704A
Application number: CN202010418467.3A
Authority: CN
Inventors: 秦叶
Original assignee: Beijing Boanzhilian Technology Co ltd
Current assignee: Beijing Boanzhilian Technology Co ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-07-14
Anticipated expiration: 2040-05-18
Also published as: CN111415704B

Abstract

The invention discloses a method for analyzing STR gene data, which is based on C + + language and carries out positioning full-automatic analysis on STR gene sequencing data of short tandem repeat sequences in DNA, thereby integrally improving the data processing efficiency, finally obtaining a gene region associated with specific characters and providing a rich positioning result display chart. The algorithm supports the analysis of multicolor fluorescence channels; supports the kit commonly used in the market at present; supporting the format of an original data file output by a common sequencer; automatically identifying and removing the miscellaneous peaks generated in the inspection; and automatically calibrating the peak type dislocation generated in the sequencing process. The invention realizes the full-automatic data analysis of the multicolor fluorescent color channel on the DNA gene short serial sequence STR in the work of DNA detection by forensic identification. Breaks through the condition that DNA sequencing only depends on foreign software analysis, lays a progress for developing a domestic sequencer and a domestic reagent in China, and accelerates the development progress of localization in the field of DNA detection by forensic identification in China.

Description

STR gene data analysis method

Technical Field

The invention relates to the technical field of STR gene data, in particular to a method for analyzing STR gene data.

Background

The short tandem repeat sequence in microsatellite DNA is short STR, is a DNA polymorphism, widely exists on 23 pairs of chromosomes of a human genome, generally consists of repeat units of 2-6bp, and the difference between different individuals (or alleles) is generally only expressed as the difference of the repeat number of the repeat units. STR loci are important genetic markers, have strong individual recognition capability at loci with high polymorphism degree, and can be simply detected by PCR technology. STR loci have the following characteristics: the STR has a large number of polymorphic loci, the fragment length is generally less than 400bp, the amplification is easy, and the method is suitable for the detection of trace detection materials; the size of the allele is relatively close, and the preferential amplification of the smaller allele is not obvious; the STR loci of different sites have smaller segment length difference, thereby being convenient for multiplex amplification and improving the detection efficiency. In addition, the STR locus analysis method is simple and convenient, and is convenient for standardization and automation of experimental DNA typing and computer storage and networking retrieval of typing data. At present, the STR technology is the DNA typing technology which has the widest application range and the highest use frequency in forensic medicine.

At present, the analysis of the STR short tandem repeat sequence usually needs professional analysis software to read and analyze the original data generated by a sequencer. In the process of reading, analyzing and converting, a complex algorithm is needed to be used for calculation according to a gene specific rule and a mathematical formula. The analysis of the raw data currently relies on foreign analysis software and is limited to sequencing analysis of the 5-color fluorescence channel. Aiming at 8-color and 9-color fluorescence channels, effective analysis cannot be carried out at home, and the overall situation of DNA identification of forensic medicine in China is delayed.

The prior art has the defects that 1, the prior domestic known short tandem repeat STR8 color and 9 color primary data analysis has the phenomena of missing peaks, wrong peaks, misjudgment and the like due to the problems of the analysis method, so that accurate gene sequencing analysis can not be carried out, and the domestic DNA sequencing is greatly hindered. 2. All the kits which are common in insomnia cannot be perfectly compatible, and the analysis fragment combination internal standard rule needs to be adjusted independently according to different kits. 3. The existing judgment method set by analysis software only aims at 5-color fluorescence channels, and the number of detectable loci is rapidly increased due to the continuous improvement of the sensitivity of the existing reagent, so that the 5-color fluorescence channels cannot be met. There is a need for assays that support more color fluorescence channels. 4. The analysis method used at present needs to judge some peak types manually in the data analysis process, and the experimenters have abundant analysis experience to make accurate analysis and judgment, so that the threshold of entrance for analysis is high, and the method cannot be popularized.

Disclosure of Invention

The present invention aims to provide a method for analyzing STR gene data to solve the problems in the background art.

A method for STR gene data analysis comprises the following steps:

step 1: data collection, namely, the original data file in the format of fsa and hid output by a sequencer is in an appointed path, the system refreshes in real time, and a new data file is checked under the appointed path, namely, the new data file is read;

step 2: analyzing original data, namely, determining the displacement position and the data length of each dyeing channel data according to the type and description of each data packet set in a header file by reading the header file data of the original data, and forming a sine-cosine wave peak image in the time direction and the frequency direction by using a Fourier series formula;

and step 3: fluorescence color separation, namely separating each fluorescence channel according to different colors according to internal standard data in the use data, wherein the internal standard data are used as absolute positioning positions of an original map, and a locus configuration file provided by a manufacturer is used as a logic position of the original map;

and 4, step 4: peak value calculation, namely, automatically judging a normal peak and an abnormal peak according to the threshold ratio of each locus range and the high-low peak, namely the peak height difference ratio of heterozygote, and judging whether the peak position is in the locus range or not or whether the peak height ratio exceeds the threshold;

and 5: the regular standard substance ladder comparison calculation-comparison calculation is to compare and calibrate the peak height value of a sample and the gene locus position in a gene locus according to the peak position of the sample and the position of the ladder position by taking a parting standard substance provided by a reagent manufacturer as a standard basis for measuring the peak;

step 6: allele is distributed according to a locus rule, a configuration file provided by a reagent manufacturer predefines the number of loci and loci distributed on each fluorescent color channel, and a system automatically compares and matches the loci and the loci of the allele with the rule according to the rule;

and 7: filtering the mixed peak, setting the peak height value of homozygote and heterozygote to be lower than 50-60 and the peak width to be 0.3-0.7 by presetting the attribute of the mixed peak, judging the mixed peak if the peak width exceeds the range, and judging the mixed peak to be an effective peak if the peak width exceeds the range;

and 8: and (4) displaying the allelic gene peak type, and finally facilitating the identification and judgment of a system and a laboratory technician.

In the above, in the step 3, the internal standard data is provided by a reagent manufacturer and is a reference substance for analyzing a real sample of the standard.

Compared with the prior art, the invention has the following beneficial effects: the full-automatic data analysis of the multicolor fluorescent color channel is realized on the DNA gene short tandem sequence STR in the work of DNA detection by forensic identification. Breaks through the condition that DNA sequencing only depends on foreign software analysis, lays a progress for developing a domestic sequencer and a domestic reagent in China, and accelerates the development progress of localization in the field of DNA detection by forensic identification in China.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of a ladder showing the rule standard of the present invention.

FIG. 3 is a drawing showing the typing standard of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is based on C + + language, carry on the full-automatic analysis of location to the STR gene sequencing data of tandem repeat sequence in DNA, raise the data processing efficiency wholly, get the genetic region associated with particular property finally, and provide the abundant display chart of location result. The algorithm supports the analysis of multicolor fluorescence channels; supports the kit commonly used in the market at present; supporting the format of an original data file output by a common sequencer; automatically identifying and removing the miscellaneous peaks generated in the inspection; and automatically calibrating the peak type dislocation generated in the sequencing process.

Referring to fig. 1-3, the present invention provides a method for analyzing STR gene data: the method comprises the following steps:

and step 3: fluorescence color separation-based on the internal standard data (the internal standard data is provided by the reagent manufacturer and is a reference object for analyzing the real sample of the standard) in the use data as the absolute positioning position of the original map, and the locus configuration file provided by the manufacturer as the logic position of the original map, each fluorescence channel is separated according to the logic position of the determined internal standard according to different colors.

And 4, step 4: and peak value calculation, namely, automatically judging a normal peak and an abnormal peak according to the threshold ratio of the range of each locus to the high peak and the low peak (the peak height difference ratio of heterozygote), and judging whether the peak position is in the range of the locus or not or whether the peak height ratio exceeds the threshold.

and 7: filtering the mixed peak, namely presetting the properties of the mixed peak, wherein if the peak height value of homozygote and heterozygote is lower than 50-60 and the peak width is between 0.3-0.7, judging that the mixed peak is the mixed peak if the peak width exceeds the range, and judging that the mixed peak is the effective peak if the peak width exceeds the range;

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for analyzing STR gene data is characterized by comprising the following steps:

2. The method of claim 1, wherein in step 3, the internal standard data is provided by a reagent manufacturer and is a reference for analyzing real samples with respect to the standard.