CN113284552B - Screening method and device for micro haplotypes - Google Patents

Screening method and device for micro haplotypes Download PDF

Info

Publication number
CN113284552B
CN113284552B CN202110654476.7A CN202110654476A CN113284552B CN 113284552 B CN113284552 B CN 113284552B CN 202110654476 A CN202110654476 A CN 202110654476A CN 113284552 B CN113284552 B CN 113284552B
Authority
CN
China
Prior art keywords
micro
data
haplotypes
sequence
single nucleotide
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110654476.7A
Other languages
Chinese (zh)
Other versions
CN113284552A (en
Inventor
乌日嘎
刘志勇
孙宏钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110654476.7A priority Critical patent/CN113284552B/en
Publication of CN113284552A publication Critical patent/CN113284552A/en
Application granted granted Critical
Publication of CN113284552B publication Critical patent/CN113284552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a screening method and a screening device of micro haplotypes, wherein the method comprises the following steps: acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened; determining N primary micro-haplotypes according to the mark coordinates of the multi-row single nucleotide polymorphism marks; searching a reference sequence corresponding to each primary micro-haplotype, and calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence; and screening the N primary micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes. The invention can accurately and rapidly screen micro haplotypes from genome or transcriptome data, shortens screening time and greatly improves screening efficiency.

Description

Screening method and device for micro haplotypes
Technical Field
The invention relates to the technical field of forensic genetics, in particular to a screening method and device of micro haplotypes.
Background
In forensic genetics, common genetic markers are Short Tandem Repeat (STR) and single nucleotide polymorphism marker (SNP) markers, and the like. However, in addition to the problems of high mutation rate of alleles and unbalanced amplification, STR genetic markers have the problem of difficult overstretching of the bottleneck of expansion application, namely easy replication and slippage in the PCR process due to the sequence structural characteristics, so that the stutter peak is generated to interfere with data analysis, and particularly, the problem of difficulty in splitting in the application of forensic mixed spots; and a single SNP genetic marker contains a small amount of genetic information, so that a large amount of genetic markers often need to be detected to reach a level equivalent to the identification capability of STR.
The Microshaplotype (MH) has the advantages of STR and SNP, has no defects, and is a very ideal forensic genetic marker. It is defined as a multiallelic molecular marker consisting of 2-5 SNPs combined in the 200bp range in the genome, originally proposed by the university of us Kidd professor laboratory. Because MH is in the primary development stage, there is no specific technical solution in the industry to search, and often, related researchers search a group of data manually one by one, or professionals employing biological information assist in writing non-professional simple scripts.
The manual searching mode is long in time consumption, low in accuracy and easy to miss, and searching is performed by writing a non-professional script, so that the workload is large, the cost is high, and the technology is difficult to popularize and apply.
Disclosure of Invention
The invention provides a screening method and a screening device for micro haplotypes, which can rapidly, efficiently and accurately screen MH gene loci from gene data.
In a first aspect, an embodiment of the present invention provides a method for screening a micro-haplotype, where the method includes:
acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
Determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
searching a reference sequence corresponding to each primary micro-haplotype, and calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence;
and screening the N primary micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M.
In a possible implementation manner of the first aspect, the determining N primary micro-haplotypes according to the marker coordinates of the multi-row single nucleotide polymorphism markers includes:
dividing the marking coordinates of the multi-row single nucleotide polymorphism marks into N groups of marking coordinate sets according to a preset reference coordinate difference value;
storing the single nucleotide polymorphism markers contained in each set of marker coordinate sets into a preset python dictionary respectively;
and respectively extracting the single nucleotide polymorphism markers contained in each group of mark coordinate sets from a preset python dictionary according to the preset storage quantity, and setting the single nucleotide polymorphism markers contained in each group of mark coordinate sets as a primary micro-haplotype to obtain N primary micro-haplotypes.
In a possible implementation manner of the first aspect, the searching for the reference sequence corresponding to each of the prime micro-haplotypes includes:
respectively preparing a sequence file according to the first single nucleotide polymorphism mark coordinate and the tail end single nucleotide polymorphism mark coordinate of each primary selection microsloid;
and inputting the sequence file into a preset sequence searching tool, and searching to obtain a reference sequence corresponding to each primary micro-haplotype.
In a possible implementation manner of the first aspect, the sequence feature parameters include GC content values, repetitive sequence features, and genome-wide multiple matching indexes;
the calculating the sequence characteristic value corresponding to each primary micro-haplotype by using each reference sequence comprises the following steps:
searching a plurality of similar sequences from preset whole genome data by taking each reference sequence as a template through BLAST analysis, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and score values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
Respectively calculating GC content values of each reference sequence;
and extracting the short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value.
In a possible implementation manner of the first aspect, the data to be screened includes genome data and transcriptome data;
the reading of the mark coordinates of the multi-row single nucleotide polymorphism marks in the data to be screened comprises the following steps:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval.
In a possible implementation manner of the first aspect, the screening from the N primary micro-haplotypes according to the sequence feature parameter to obtain M target micro-haplotypes includes:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
And obtaining M target micro-haplotypes from the N pre-selected micro-haplotypes, wherein the GC content value corresponding to the pre-selected micro-haplotype meets the preset content value condition, the repetitive sequence characteristic corresponding to the pre-selected micro-haplotype meets the preset target sequence characteristic condition and the genome-wide multi-matching index meets the preset index condition.
In a possible implementation manner of the first aspect, the method further includes:
the method comprises the steps of obtaining typing data, wherein the typing data comprise single nucleotide polymorphism marking typing data of a plurality of crowds;
splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names, wherein each group typing data comprises single nucleotide polymorphism marking typing data corresponding to each sample.
In a possible implementation manner of the first aspect, the method further includes:
and calculating corresponding forensic parameters by adopting the target micro-haplotype, wherein the forensic parameters comprise allele typing and frequency thereof, heterozygosity observation value, heterozygosity expected value, matching probability, polymorphism information content, individual identification probability, triplet non-father exclusion probability, duplex non-father exclusion probability value and effective allele factors.
In a possible implementation manner of the first aspect, after the step of determining the priming micro-haplotype according to the plurality of single nucleotide polymorphism marker coordinates, the method further comprises:
each of the primary microsloids was named.
The second aspect of the embodiment of the present application further provides a screening apparatus for micro haplotypes, where the apparatus includes:
the reading module is used for acquiring data to be screened and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
the determining module is used for determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
the calculation module is used for searching the reference sequence corresponding to each primary micro-haplotype and calculating the sequence characteristic parameter corresponding to each primary micro-haplotype by using each reference sequence;
a screening module for screening M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M
Compared with the prior art, the screening method and device for the micro haplotypes provided by the embodiment of the application have the beneficial effects that: the application can obtain the preliminary micro-haplotype by reading the mark coordinates of the single nucleotide polymorphism mark and performing rough screening based on the mark coordinates of the single nucleotide polymorphism mark, then searches the reference sequence of the preliminary micro-haplotype, calculates the sequence characteristic value according to the reference sequence, and finally screens the target micro-haplotype according to the sequence characteristic value, thereby realizing the effect of rapid screening of the micro-haplotype. The whole process is simple and quick, so that the screening time can be shortened, the screening efficiency can be improved, and the screening accuracy can be improved; the application can realize the whole process of screening and evaluating MH from the original data of genome and transcriptome, and forms a whole set of technical scheme, so that the prior technical scheme is integrated and improved, and the practicability and flexibility of screening are greatly improved; the application also provides a unified genome and transcriptome-derived MH locus and a corresponding allele naming scheme, which are convenient for information exchange and computer rapid data processing between different laboratories.
Drawings
FIG. 1 is a flow chart of a method for screening micro haplotypes according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for screening micro-haplotypes according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a screening device for micro haplotypes according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
There is no specific technical solution in the industry to search, and a group of data is often searched manually one by a related researcher, or a professional employing biological information assists in writing a non-professional simple script to search. The manual searching mode is long in time consumption, low in accuracy and easy to miss, and a non-professional person is limited to master the forensic genetics professional knowledge, so that the written script is low in accuracy and high in cost, and the technology is difficult to popularize and apply.
In order to solve the above problems, a method for screening micro haplotypes according to the present application will be described and illustrated in detail by the following specific examples.
Referring to fig. 1, a flow chart of a screening method of micro haplotypes according to an embodiment of the present application is shown.
Wherein, as an example, the screening method of the micro-haplotype can comprise the following steps:
s11, acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened.
The data to be screened is a comment file which is prepared by a user in advance and contains SNP, and can be a file with a similar VCF format. The file may include (1) CHROM: a chromosome; (2) POS: genomic location; (3) ID: the rsID number of the mutation site, if not, "; (4) REF: reference typing; (5) ALT: typing of variants; (6) QUAL: call (name) the mass of this site; (7) Filter: filtering the mutation site, if passing, PASS, and if not, filtering; (8) INFO: detailed information of variant. Specifically, the user may download SNP-VCF files of thousands of genome data in advance.
In order to improve screening efficiency, in actual operation, an empty worksheet file can be prepared in python software, and SNP-VCF files are prepared to be stored; next, import the module of python such as pandas, os, xlwt, xlsxwriter, openpyxl, openpyxl, csv, etc. in python; and then reading SNP-VCF files with different chromosome numbers one by using a for loop, and reading SNP-VCF files of each chromosome row by using the for loop to obtain coordinates of the multi-row single nucleotide polymorphism markers.
In this embodiment, the data to be screened may include genomic data and transcriptome data, wherein step S11 may include the following sub-steps, as an example:
and sub-step S111, when the data to be screened is genome data, reading the mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data.
In this embodiment, the label coordinates are the difference in coordinates between the coordinates of the single nucleotide polymorphism label and the starting coordinates.
In a specific implementation, for convenience of screening, a user may set a SNP-VCF file corresponding to each chromosome by using python, then read the SNP-VCF file of each chromosome line by using for circulation, split the content of the SNP-VCF file by using a split function to obtain the chromosome number of each chromosome, and information such as the name and position of a single nucleotide polymorphism marker (SNP), thereby obtaining a plurality of single nucleotide polymorphism markers (SNPs).
In this embodiment, the specific search and read may be: setting the initial SNP position coordinate as 'loc=0', subtracting 'loc' from the single nucleotide polymorphism marker (SNP) position coordinate obtained in the first row to obtain the marker coordinate of the single nucleotide polymorphism marker in the first row, then subtracting 'loc' from the position coordinate of the single nucleotide polymorphism marker (SNP) in the second row to obtain the marker coordinate of the single nucleotide polymorphism marker in the second row, and so on to obtain the marker coordinate of the single nucleotide polymorphism marker in each row.
And step S112, when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, and taking a distance from the start coordinate to the stop coordinate as a coordinate interval.
Substep S113, screening the coordinate interval for the label coordinates of the plurality of target single nucleotide polymorphism labels having coordinate values within the coordinate interval.
In this embodiment, the transcriptome data is data that is collated in advance by the user. The transcriptome data may be a BED-like format file, wherein the BED-like format file contains six columns of data contents, respectively: (1) Chrom: chromosome numbering; (2) ChromStart: chromosome start coordinates; (3) ChromEnd: chromosome end coordinates; (4) Name: line name: (5) Score:0-1000, gray values displayed in the genome browser; (6) Strand: positive and negative chain labeling.
In a specific implementation, the "ChromStart" and "chromsend" coordinates thereof may be obtained, resulting in start and end marker coordinates.
In practical operation, the transcriptome data and the genome data can be screened and read simultaneously, and the cSNP genetic markers from the transcriptome data can be screened out by traversing the SNPs-VCF files of the sub-step S111 respectively by using the for loop and combining the if condition.
The specific searching process comprises the following steps: after reading the BED file row by row using the for loop, genome start and stop coordinates (i.e. "ChromStart" and "chromsend") of a certain transcriptome genetic marker are obtained, and a coordinate interval, i.e. "ChromStart-chromsend" interval, is determined using the "ChromStart" and "chromsend" coordinates.
After the interval is determined, the coordinates of the single nucleotide polymorphism markers of each row may be calculated using the coordinate calculation method of sub-step S111. For specific calculation methods, reference may be made to the above, and in order to avoid repetition, details are not repeated here.
In this embodiment, it may be determined whether the genetic marker coordinate corresponding to the chromosome is in the "chromastart-chromaend" region, and if the genetic marker coordinate corresponding to the chromosome is in the region, it may be determined that the single nucleotide polymorphism marker (SNP) is crsnp.
In order to facilitate the subsequent operation, the format file of the preset SNP-VCF and the BED file of the transcriptome data can be spliced based on the found cSNP, and the result file in the VCF format can be obtained by sorting and outputting, wherein the result file can be a VCF-like file containing the transcriptome cSNP.
S12, determining N primary micro-haplotypes according to the multi-row single nucleotide polymorphism marker coordinates, wherein N is a positive integer greater than or equal to 1.
As the obtained mononucleotide polymorphism mark coordinates have a plurality of rows, rough screening can be carried out once according to the mononucleotide polymorphism mark coordinates, a certain number of preliminary micro-haplotypes are determined, and then the target micro-haplotypes are screened from the preliminary micro-haplotypes, so that the screening accuracy is improved.
To increase the efficiency of the screening, step S12 may include the following sub-steps, as an example:
and step S121, dividing the marking coordinates of the multi-row single nucleotide polymorphism marking into N groups of marking coordinate sets according to the preset reference coordinate difference value.
In this embodiment, the mark coordinate sets may be divided into a group of mark coordinate sets every 100 rows, and may be adjusted every 50 rows, 200 rows or 500 rows, specifically according to actual needs.
And step S122, storing the single nucleotide polymorphism markers contained in each set of marker coordinate sets into a preset python dictionary.
For the convenience of computer calculation, in this embodiment, traversal calculation may be performed to determine whether the coordinates of the single nucleotide polymorphism markers in each row belong to a set of coordinates of the single nucleotide polymorphism markers, and when the same set of coordinates of the single nucleotide polymorphism markers is input, the coordinates of the single nucleotide polymorphism markers may be stored in a preset python dictionary.
For example, the preset reference coordinate difference value is 200, the coordinates of the 2 nd line of the single nucleotide polymorphism marker and the coordinates of the 1 st line of the single nucleotide polymorphism marker are different, the mark coordinates of the 2 nd line of the single nucleotide polymorphism marker and the coordinates of the 1 st line of the single nucleotide polymorphism marker are obtained as 50, 50 is smaller than 200, the coordinates of the 2 nd line of the single nucleotide polymorphism marker and the coordinates of the 1 st line of the single nucleotide polymorphism marker are divided into a group, the single nucleotide polymorphism markers contained in the 1 st line of the single nucleotide polymorphism marker and the coordinates of the 2 nd line of the single nucleotide polymorphism marker are stored in a group of the preset python dictionary, then the coordinates of the 3 rd line of the single nucleotide polymorphism marker and the coordinates of the 1 st line of the single nucleotide polymorphism marker are obtained as 120, the coordinates of the 3 rd line of the single nucleotide polymorphism marker are obtained as 120, the coordinates of the single nucleotide polymorphism marker are also stored in a group of the preset python dictionary, the 4 th line of the single nucleotide polymorphism marker and the 1 st line of the single nucleotide polymorphism marker are obtained as 200, and the difference value between the coordinates of the 4 th line of the single nucleotide polymorphism marker and the 1 st line of the single nucleotide polymorphism marker is greater than 200.
Then, the coordinates of the 4 th line of single nucleotide polymorphism markers are recalculated as the coordinates of the initial single nucleotide polymorphism markers of the second set of marker coordinates (microsuped), the coordinate differences between the coordinates of the 4 th line of single nucleotide polymorphism markers and the coordinates of the 5 th line of single nucleotide polymorphism markers are calculated, the coordinate differences are compared with preset reference coordinate differences, and so on.
It should be noted that, in actual operation, the preset reference coordinate difference may be 50, 80, 300, 550, n, etc., and may be specifically adjusted according to actual needs.
In a specific implementation, the preset python dictionary may be composed of "keys" that may store the names of the micro-haplotype loci contained in each row of single nucleotide polymorphism marker coordinates, and "values" that may store attributes of each micro-haplotype locus, such as a combination of the position coordinates of the individual single nucleotide polymorphism markers, etc. The "key" and the "value" are in one-to-one correspondence.
And step S123, respectively extracting the number of the single nucleotide polymorphism markers contained in each group of the marker coordinate sets from a preset python dictionary according to the preset storage number, and determining the single nucleotide polymorphism markers contained in each group of the marker coordinate sets as a primary micro-haplotype to obtain N primary micro-haplotypes.
In actual operation, the judgment of the storage quantity of each set of mark coordinate sets can be repeated, if the quantity of the single nucleotide polymorphism marks stored in the mark coordinate sets meets the minimum requirement set by a user, the single nucleotide polymorphism marks corresponding to the storage quantity are extracted from the mark coordinate sets, and a preliminary micro-haplotype is obtained until all the mark coordinate sets are judged.
For example, if statement judgment may be used to judge the number of single nucleotide polymorphism markers (SNPs) contained in the preset memory dictionary, and if the number is greater than the minimum requirement (e.g., 3) of the preset memory number, a micro-haplotype (MH) is determined to be found, and the coordinate set of the single nucleotide polymorphism markers is the initially selected micro-haplotype.
In this example, the effect of efficient screening of Micro Haplotypes (MH) can be achieved by using the for loop in combination with if conditions according to the criteria of greater than or equal to 2 SNPs in the 200bp range.
In practice, in order to be able to subsequently distinguish and sort each micro-haplotype (MH), the method may further comprise, as an example, after step S12:
and step S21, naming each primary micro-haplotype.
Specifically, the selected micro-haplotypes obtained by screening can be arranged, and meanwhile, chromosome numbers, the names of the micro-haplotypes (MH) customized by a user, and information such as SNP (Single nucleotide polymorphism) and position coordinates contained in the micro-haplotypes (MH) are written into a preset worksheet file, so that the effect of adding marks to each micro-haplotype can be realized.
At the time of subsequent management, the corresponding required micro-haplotype (MH) can be found according to the name or number.
When naming, the invention provides a new naming mode of MH gene locus according to the scientific research and practical application habit of forensics and also considering the convenience of computer processing large-scale information.
For example, if a micro-haplotype (MH) from a genomic source, it may be named "MH21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA", where "MH" is an abbreviation for micro-haplotype (microhaplotype) English letters; the number "21" after mh represents the chromosome in which the locus is located (the part takes on values of positive integers from 01 to 22 and X, Y, MT); the capital letter "SHY" represents the shorthand for the laboratory name where the mark was found; the latter number "5" is the numerical sequence number for the found microsloid locus, indicating that the marker is the 5 th microsloid found in chromosome 21 by the present laboratory.
For the ascending position coordinates of the SNP contained in the MH, the reference typing (first letter) and the variant typing (non-first letter) of the SNP follow their respective coordinates. In this nomenclature, it can be considered that it is made up of two parts, delimited by the first "/". The part preceding the first "/" is the brief name of the tag. For example, the label may simply be referred to as "mh21SHY5" to facilitate verbal or written communication by the relevant personnel in forensic practice; the contents of the mark are added with the complete name of the mark, so that the comparison of the genetic mark in different laboratories is facilitated, and the informatization display and processing of a computer are facilitated.
The naming of the MH is a relatively moderate mode, can facilitate the basic research of professionals on the type of marks, and is also beneficial to large-scale forensic application (including building a large database of MH marks) after the mark research is mature.
In addition, similar to genome-derived MH locus naming, transcriptome-derived MH has the same problem of standard naming, and no recommended naming mode is disclosed at home and abroad at present. The application also proposes a naming scheme associated with the genomic source MH, for example, "MH21SHY1/H1_circ_007207/15024188CT/15024224GAC/15024284GC". This nomenclature is compared to the genomic-derived MH described above, and adds "h1_circ_007207" to the first SNP position coordinate after the natural number is marked, representing the source of the genomic genetic marker of the MH (in this example, a circRNA molecule), and also distinguishing it from the genomic-derived MH, all in the same sense as the genomic-derived MH nomenclature.
Similarly, alleles identical to the MH loci of genomic and transcriptome origin need to be named. Alternatively, the typing of its SNPs may be used for naming. For example, MH21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA of genomic origin as described above may be allele-typed directly written in ascending order of SNP coordinates as revealed in the MH locus name, i.e. "ATGC", a typing of "a" and "T" for 15765902 and a typing of "15765915", respectively, and the like.
The MH locus naming method based on genome and transcriptome sources and the corresponding allele naming mode provided by the invention have simple rules and are easy to master, so that the method is beneficial to communication popularization among different laboratories and is beneficial to computer batch information processing.
S13, respectively searching a reference sequence corresponding to each primary micro-haplotype, and respectively calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence.
The reference sequence may be a fasta format sequence of a micro-haplotype (MH). The sequence characteristic parameter may be a scoring value for a subsequent forensic reference calculation of the micro-haplotype (MH).
In this embodiment, the reference sequence corresponding to each of the primary micro haplotypes is searched for by using bedtools software, and then the grch38_human_ref file is used as a template sequence for reference, and the grch38_human_ref file is a standard reference file used for human gene alignment.
Specifically, in order to be able to increase the accuracy of the search, step S13 may include the following sub-steps, as an example:
and S131, respectively preparing sequence files according to the coordinates of the first single nucleotide polymorphism markers and the coordinates of the tail single nucleotide polymorphism markers of each primary micro-haplotype.
Specifically, the position coordinates of the first line of SNP and the position coordinates of the last line of SNP of the primary menu micro-haplotype can be extracted.
In this embodiment, two rows of coordinates may be spliced, where the splicing result includes: CHROM, start position of cMH, end position of cMH, cmh_name ".
And S132, inputting the sequence file into a preset sequence searching tool, and searching to obtain a reference sequence corresponding to each primary micro-haplotype.
The sequence file is input into bedtools software, and fasta format sequences corresponding to micro-haplotypes (MH) are searched by the bedtools software according to the whole human genome (GRCh38_human_ref).
The sequence of the micro-haplotype (MH), for example in fasta format, itself affects subsequent specific primer design and amplification, and thus subsequent forensic applications. In order to determine whether the found MH sequence has uniqueness in the genome and to evaluate the accuracy of the found and calculated similar sequence, in this embodiment, the sequence feature parameters include GC content values, repetitive sequence features and genome-wide multiple matching indexes, where, as an example, step S13 may further include the following sub-steps:
and S134, searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and score values.
Specifically, a blastn algorithm in blastn software can be used to find similar sequences in the genome.
The blastn algorithm can be used to find multiple sequences within the full human genome (grch38_human_ref) that are similar to the fasta format sequences of the micro-haplotype (MH).
In practice, the E value threshold of the fasta format sequence of the prime micro-haplotype (MH) can be set to 0.00001 in the blastn algorithm, then similar sequences are found, and a plurality of similar sequences similar to the reference sequence are selected from the results.
For example, some reference sequences may be found from the fully human genome to 10-20 similar sequences or more.
Simultaneously with the search, an evaluation parameter for each searched similar sequence may be calculated, which may specifically comprise an expected value (E) and a score value (S).
Wherein the score value is a similarity rating of the similar sequence to the reference sequence, and a higher score indicates a greater degree of similarity between the reference sequence and the similar sequence. Specifically, a blastn algorithm can be adopted to find the score information of each similar sequence, so as to obtain the score value of each similar sequence.
The expected value is a reliability evaluation of the score value, and is the probability that the similarity between other sequences in the database and the reference sequence is larger than the similarity between the similar sequences and the reference sequence under the random condition, and the lower the value is, the better. Specifically, a blastn algorithm can be adopted to find out expected evaluation information of each similar sequence, and expected values corresponding to the similar sequences are obtained.
And S135, counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multi-matching index.
The whole genome multi-matching index is the number of similar sequences with expected values meeting the requirement of a preset expected value and score values meeting the requirement of a preset score value, and as a plurality of similar sequences are provided, the evaluation parameters of each similar sequence are high and low respectively, if each similar sequence is used, the workload is high, and meanwhile, the screening precision and efficiency are reduced.
To reduce the effort and improve the screening accuracy, the number of similar sequences required may be determined based on the expected value and the score value.
Specifically, the plurality of similar sequences may be ordered according to the expected value and the score value, and then a certain number of similar sequences may be screened. For example, 10 similar sequences are sorted according to the scores of the expected value and the score value from high to low, then the first 5 similar sequences are extracted, the first 5 similar sequences are used as target similar sequences, and the first 5 similar sequences are output and stored in an XML format.
The higher the number of multiple matching sequences satisfying a specific score and expected value over the whole genome, the higher the likelihood of non-specific amplification at the time of subsequent detection, and the more likely the analysis of the result will be disturbed. ,
substep S136, calculating GC content values of each of the reference sequences, respectively.
GC content value refers to the ratio of guanine and cytosine among 4 bases of DNA. Specifically, the GC content value of the reference sequence of each micro-haplotype (MH) can be calculated because the GC content value has a large influence on the PCR process.
And step S137, extracting the short tandem repeat sequence features from each reference sequence according to the preset repeat sequence feature value.
Specifically, it can be found whether the reference sequence contains a sequence similar to a Short Tandem Repeat (STR).
For example, the motif may be set to contain 1 to 6 bases and have a number of repetitions of 4 or more, and then the Short Tandem Repeat (STR) in the reference sequence is extracted to obtain the short tandem repeat.
Finally, the short tandem repeat sequence and the GC content value can be adjusted and output as a new file: the file includes: mh_name, GC%, number_number, repeat_number.
S14, screening the N primary micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M.
After the sequence characteristic value and the GC content value of each primary micro-haplotype (MH) are obtained, the judgment can be carried out according to the sequence characteristic value and the GC content value of each micro-haplotype (MH), and if the sequence characteristic value of each micro-haplotype (MH) meets a preset characteristic threshold or the GC content value meets a preset content threshold, the primary micro-haplotype is determined to be a target micro-haplotype.
Since the sequence characteristic parameters include GC content values, repetitive sequence characteristics and genome wide multiple matching indices, step S14 may include the sub-steps of:
And S141, judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-match index meets the preset index condition.
And S142, screening M preliminary micro-haplotypes from the N preliminary micro-haplotypes, wherein the GC content values corresponding to the preliminary micro-haplotypes meet the preset content value condition, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet the preset target sequence characteristic condition and the genome-wide multi-match index meets the preset index condition, so as to obtain M target micro-haplotypes.
Specifically, the GC content value, the repetitive sequence characteristic, and the whole genome multiple matching index of each of the preliminary micro-haplotypes may be compared with the corresponding preset content value, the preset comparison sequence characteristic, and the preset quantity value, respectively, and when the GC content value satisfies the preset content value, the repetitive sequence characteristic satisfies the preset comparison sequence characteristic, and the whole genome multiple matching index satisfies the preset quantity value, the preliminary micro-haplotype is determined as the target sequence characteristic parameter.
To facilitate the subsequent forensic parameter calculation, the method may further comprise, as an example:
s15, acquiring parting data, wherein the parting data comprise single nucleotide polymorphism marking parting data of a plurality of crowds.
S16, splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names, wherein each group typing data comprises single nucleotide polymorphism marking typing data corresponding to each sample.
Specifically, the typing data is data which is obtained by presetting acquisition or downloading of a user and comprises a certain number of people and human genome data.
Splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names.
For example, the typing data is data composed of 2504 persons in total from 26 group sources, and the typing data can be split into 26 group typing data according to different group sources.
In a specific implementation, a user can preset a TXT file containing sample names and corresponding group source information, then open individual original typing data of thousands of genomes one by one according to different chromosomes, then use for circulation to match and search one by one, correspond to corresponding samples and group sources according to the sample names recorded in TXT, and finally split the typing data into 26 group typing data.
In addition, because each group parting data comprises the single nucleotide polymorphism marking parting data corresponding to each sample, the group parting data can be split into a file in CSV format in order to facilitate the subsequent calculation of the group parting data and the target micro-haplotype acquired by the genome or the transcriptome.
To calculate common forensic parameters for each population of each genome, the method may further comprise, as an example:
s17, calculating corresponding forensic parameters by adopting the target micro-haplotype, wherein the forensic parameters comprise allele typing and frequency thereof, heterozygosity observation value, heterozygosity expected value, matching probability, polymorphic information content, individual identification probability, triplet non-father exclusion probability, duplex non-father exclusion probability value and effective allele factors.
Specifically, the single nucleotide polymorphism markers (SNPs) typing data of the respective populations may be combined according to the single nucleotide polymorphism markers (SNPs) contained in each of the target micro-haplotypes (MHs) obtained by the above screening. If the CSV file is obtained by splitting a thousand-person genome, all SNP genotype information of individuals contained in the CSV file can be combined into the genotyping of a target MH (because each MH consists of a plurality of adjacent SNPs), the corresponding MH genotype of each individual is combined, the subsequent calculation can be facilitated, the SNPs contained in the MH are converted into a standard 'ATCG' base format (the original genotyping of the SNPs is named by numbers, and for convenience in viewing, the SNPs are converted into the ATCG base format).
In particular, the forensic parameters may include allele typing and its frequency, heterozygosity observations, heterozygosity expectations, matching probabilities, polymorphism information content, individual identification probabilities, triplet non-father exclusion probabilities, doublet non-father exclusion probability values, and valid allele factors.
Specifically, the observed value of heterozygosity h = the number of heterozygotes in the sample/the total number of individuals in the sample;
expected value of heterozygosityWhere n is the total number of all alleles in the sample, k is the number of alleles or haplotype species, p i Frequency of the ith allele or haplotype for the sample;
probability of matchingWherein n is the number of genotypes of a certain genetic marker, p i Frequency of the ith genotype for the population;
polymorphic information contentWherein n is the total number of all alleles in the sample, p i Is the frequency of the ith allele;
probability of individual identificationWherein n is the number of genotypes of a certain genetic marker, p i Frequency of the ith genotype for the population;
triplet non-father exclusion probabilityWherein n is the total number of all alleles in the sample, p i And p is as follows j Frequencies of the i and j th alleles, respectively;
diad non-father exclusion probabilityWherein n is the total number of all alleles in the sample, p i And p is as follows j Frequencies of the i and j th alleles, respectively;
effective allelic factorsWherein n is the total number of all alleles in the sample, p i Frequency of the ith allele.
And finally, the calculation result can be sorted, stored and output, and specifically, can be stored as a CSV format, so that the user can conveniently call the calculation result later.
Referring to fig. 2, an operation flowchart of a screening method of micro haplotypes according to an embodiment of the present invention is shown.
In actual operation, data to be screened can be obtained respectively, wherein the data to be screened can comprise genome data and transcriptome data, if the genome data is the genome data, the position coordinates of the corresponding single nucleotide polymorphism markers are read from the genome data and VCF files are generated, if the transcriptome data is the transcriptome data, the transcriptome data can be converted into the corresponding BED files, and then the corresponding BED files are transcribed into the VCF files; then roughly screening to obtain a primary selection micro haplotype; searching a fasta reference sequence corresponding to the preliminary micro-haplotype, and generating a sequence file corresponding to the fasta reference sequence; and searching the corresponding similar sequence, calculating the GC content value and the short tandem repeat sequence corresponding to the fasta reference sequence, and screening according to the set threshold value to obtain the target micro haplotype. Furthermore, typing data comprising a plurality of people are obtained and split; and finally, calculating forensic parameters corresponding to the target micro haplotype by using the split group parting data.
In this embodiment, the embodiment of the application provides a screening method for micro haplotypes, which has the following beneficial effects: the application can obtain the preliminary micro-haplotype by reading the position coordinates of the single nucleotide polymorphism markers, performing rough screening based on the position coordinates of the single nucleotide polymorphism markers, searching the reference sequence of the preliminary micro-haplotype, calculating the sequence characteristic value according to the reference sequence, and finally screening the target micro-haplotype according to the sequence characteristic value, thereby realizing the effect of rapid screening of the micro-haplotype. The whole process is simple and quick, the screening time can be shortened, the screening efficiency can be improved, meanwhile, the screening accuracy can be improved, the whole process of screening and evaluating the MH from the original data of the genome and the transcriptome can be realized, a whole set of technical scheme is formed, the prior art scheme is integrated and improved, the screening practicability and flexibility are greatly improved, and meanwhile, the application also provides a unified genome and transcriptome-derived MH locus and a corresponding allele naming scheme, and the information communication and computer rapid data processing between different laboratories are convenient.
The embodiment of the application also provides a screening device for the micro-haplotype, and referring to fig. 3, a schematic structural diagram of the screening device for the micro-haplotype is shown.
Wherein, as an example, the screening device of micro-haplotypes may comprise:
the reading module 301 is configured to obtain data to be screened, and read tag coordinates of a plurality of rows of single nucleotide polymorphism tags in the data to be screened;
a determining module 302, configured to determine N prime microscales according to the label coordinates of the multiple rows of single nucleotide polymorphism markers, where N is a positive integer greater than or equal to 1;
the calculation module 303 is configured to search for a reference sequence corresponding to each of the preliminary micro-haplotypes, and calculate a sequence feature parameter corresponding to each of the preliminary micro-haplotypes using each of the reference sequences;
and a screening module 304, configured to screen M target micro-haplotypes from the N primary micro-haplotypes according to the sequence feature parameters, where M is a positive integer greater than or equal to 1, and N is greater than or equal to M.
Optionally, the determining module is further configured to:
dividing the marking coordinates of the multi-row single nucleotide polymorphism marks into N groups of marking coordinate sets according to a preset reference coordinate difference value;
storing the single nucleotide polymorphism markers contained in each set of marker coordinate sets into a preset python dictionary respectively;
And respectively extracting the single nucleotide polymorphism markers contained in each group of mark coordinate sets from a preset python dictionary according to the preset storage quantity, and setting the single nucleotide polymorphism markers contained in each group of mark coordinate sets as a primary micro-haplotype to obtain N primary micro-haplotypes.
Optionally, the computing module is further configured to:
respectively preparing a sequence file according to the first single nucleotide polymorphism mark coordinate and the tail end single nucleotide polymorphism mark coordinate of each primary selection microsloid;
and inputting the sequence file into a preset sequence searching tool, and searching to obtain a reference sequence corresponding to each primary micro-haplotype.
Optionally, the sequence feature parameters include GC content values, repetitive sequence features, and genome wide multiple matching indicators;
the computing module is further for:
searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and scoring values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
Respectively calculating GC content values of each reference sequence;
and extracting the short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value.
Optionally, the data to be screened comprises genomic data and transcriptome data;
the reading module is further configured to:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval.
Optionally, the screening module is further configured to:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
And screening M preliminary micro-haplotypes of which the GC content values corresponding to the preliminary micro-haplotypes meet preset content value conditions, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet preset target sequence characteristic conditions and the genome-wide multi-match indexes meet preset index conditions from the N preliminary micro-haplotypes to obtain M target micro-haplotypes.
Optionally, the apparatus further comprises:
the parting module is used for acquiring parting data, wherein the parting data is single nucleotide polymorphism marking parting data comprising a plurality of crowds;
the splitting module is used for splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names, wherein each group typing data comprises single nucleotide polymorphism marking typing data corresponding to each sample.
Optionally, the apparatus further comprises:
and the forensic parameter module is used for calculating corresponding forensic parameters by adopting the target micro-haplotype, wherein the forensic parameters comprise allele typing and frequency thereof, heterozygosity observation value, heterozygosity expected value, matching probability, polymorphic information content, individual identification probability, triplet non-father exclusion probability, duplex non-father exclusion probability value and effective allele factors.
Optionally, after the step of determining the prime microsloid based on the plurality of single nucleotide polymorphism marker coordinates, the apparatus further comprises:
a naming module for naming each of the primary micro-haplotypes
Further, an embodiment of the present application further provides an electronic device, including: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements the method of screening micro-haplotypes as described in the above embodiments.
Further, an embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for screening micro-haplotypes according to the above embodiment.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the application, such changes and modifications are also intended to be within the scope of the application.

Claims (7)

1. A method for screening a microsloid comprising:
Acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
searching a reference sequence corresponding to each primary micro-haplotype, and calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence;
screening the N primary selected micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M;
the sequence characteristic parameters comprise GC content values, repeated sequence characteristics and genome-wide multiple matching indexes;
the calculating the sequence characteristic value corresponding to each primary micro-haplotype by using each reference sequence comprises the following steps:
searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and scoring values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
Respectively calculating GC content values of each reference sequence;
extracting short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value;
the data to be screened comprises genome data and transcriptome data;
the reading of the mark coordinates of the multi-row single nucleotide polymorphism marks in the data to be screened comprises the following steps:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, and taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval;
the screening of M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters comprises the following steps:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
And screening M preliminary micro-haplotypes of which the GC content values corresponding to the preliminary micro-haplotypes meet preset content value conditions, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet preset target sequence characteristic conditions and the genome-wide multi-match indexes meet preset index conditions from the N preliminary micro-haplotypes to obtain M target micro-haplotypes.
2. The method according to claim 1, wherein determining N prime microscales based on the marker coordinates of the plurality of rows of single nucleotide polymorphism markers comprises:
dividing the marking coordinates of the multi-row single nucleotide polymorphism marks into N groups of marking coordinate sets according to a preset reference coordinate difference value;
storing the single nucleotide polymorphism markers contained in each set of marker coordinate sets into a preset python dictionary respectively;
and respectively extracting the single nucleotide polymorphism markers contained in each group of mark coordinate sets from a preset python dictionary according to the preset storage quantity, and setting the single nucleotide polymorphism markers contained in each group of mark coordinate sets as a primary micro-haplotype to obtain N primary micro-haplotypes.
3. The method according to claim 1, wherein the searching for the reference sequence corresponding to each of the preliminary micro-haplotypes comprises:
respectively preparing a sequence file according to the first single nucleotide polymorphism mark coordinate and the tail end single nucleotide polymorphism mark coordinate of each primary selection microsloid;
and inputting the sequence file into a preset sequence searching tool, and searching to obtain a reference sequence corresponding to each primary micro-haplotype.
4. The method of screening for micro-haplotypes according to claim 1, further comprising:
the method comprises the steps of obtaining typing data, wherein the typing data comprise single nucleotide polymorphism marking typing data of a plurality of crowds;
splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names, wherein each group typing data comprises single nucleotide polymorphism marking typing data corresponding to each sample.
5. The method of screening for micro-haplotypes according to claim 1, further comprising:
and calculating corresponding forensic parameters by adopting the target micro-haplotype, wherein the forensic parameters comprise allele typing and frequency thereof, heterozygosity observation value, heterozygosity expected value, matching probability, polymorphism information content, individual identification probability, triplet non-father exclusion probability, duplex non-father exclusion probability value and effective allele factors.
6. The method of claim 1, wherein after the step of determining N prime microscales based on the marker coordinates of the plurality of rows of single nucleotide polymorphism markers, the method further comprises:
each of the primary microsloids was named.
7. A screening apparatus for microsloids, said apparatus comprising:
the reading module is used for acquiring data to be screened and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
the determining module is used for determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
the calculation module is used for searching the reference sequence corresponding to each primary micro-haplotype and calculating the sequence characteristic parameter corresponding to each primary micro-haplotype by using each reference sequence;
the screening module is used for screening M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M;
the sequence characteristic parameters comprise GC content values, repeated sequence characteristics and genome-wide multiple matching indexes;
The calculating the sequence characteristic value corresponding to each primary micro-haplotype by using each reference sequence comprises the following steps:
searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and scoring values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
respectively calculating GC content values of each reference sequence;
extracting short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value;
the data to be screened comprises genome data and transcriptome data;
the reading of the mark coordinates of the multi-row single nucleotide polymorphism marks in the data to be screened comprises the following steps:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, and taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval;
The screening of M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters comprises the following steps:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
and screening M preliminary micro-haplotypes of which the GC content values corresponding to the preliminary micro-haplotypes meet preset content value conditions, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet preset target sequence characteristic conditions and the genome-wide multi-match indexes meet preset index conditions from the N preliminary micro-haplotypes to obtain M target micro-haplotypes.
CN202110654476.7A 2021-06-11 2021-06-11 Screening method and device for micro haplotypes Active CN113284552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110654476.7A CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110654476.7A CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Publications (2)

Publication Number Publication Date
CN113284552A CN113284552A (en) 2021-08-20
CN113284552B true CN113284552B (en) 2023-10-03

Family

ID=77284418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110654476.7A Active CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Country Status (1)

Country Link
CN (1) CN113284552B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862177A (en) * 2017-07-12 2018-03-30 中国水产科学研究院淡水渔业研究中心 A kind of construction method for the SNP molecular labeling collection for distinguishing carp colony
CN109346130A (en) * 2018-10-24 2019-02-15 中国科学院水生生物研究所 A method of directly micro- haplotype and its parting are obtained from full-length genome weight sequencing data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6909971B2 (en) * 2001-06-08 2005-06-21 Licentia Oy Method for gene mapping from chromosome and phenotype data
US20200168299A1 (en) * 2017-07-28 2020-05-28 Pioneer Hi-Bred International, Inc. Systems and methods for targeted genome editing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862177A (en) * 2017-07-12 2018-03-30 中国水产科学研究院淡水渔业研究中心 A kind of construction method for the SNP molecular labeling collection for distinguishing carp colony
CN109346130A (en) * 2018-10-24 2019-02-15 中国科学院水生生物研究所 A method of directly micro- haplotype and its parting are obtained from full-length genome weight sequencing data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
遗传标记微单倍型在法医学中的研究进展;陈鹏;朱镜;姜又菁;陈丹;王惠;毛炯;梁伟波;张林;;中国法医学杂志(05);第54-57页 *

Also Published As

Publication number Publication date
CN113284552A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
US11335435B2 (en) Identifying ancestral relationships using a continuous stream of input
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
AU2021257920A1 (en) Variant classifier based on deep neural networks
CN107075571B (en) Systems and methods for detecting structural variants
US20060136144A1 (en) Nucleic acid analysis
KR20160107237A (en) Systems and methods for use of known alleles in read mapping
CN101233509A (en) Method of processing and/or genome mapping of ditag sequences
EP3052651A1 (en) Systems and methods for detecting structural variants
CN110257547B (en) Corn core SNP marker developed based on KASP technology and application thereof
CN107002120A (en) Sequence measurement
US20190139628A1 (en) Machine learning techniques for analysis of structural variants
Liu Bioinformatics in aquaculture: principles and methods
CN115458052A (en) Gene mutation analysis method, equipment and storage medium based on first generation sequencing
CN108182348A (en) DNA methylation data detection method and its device based on Seed Sequences information
CN113284552B (en) Screening method and device for micro haplotypes
Fletcher et al. AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data
US8189931B2 (en) Method and apparatus for matching of bracketed patterns in test strings
WO2016120777A2 (en) System and method for predicting restriction associated snp profiles to identify an organism
CN112885407B (en) Second-generation sequencing-based micro-haplotype detection and typing system and method
Cellerino et al. Transcriptome Analysis: Introduction and Examples from the Neurosciences
Fletcher et al. AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data
Rachappanavar et al. Analytical Pipelines for the GBS Analysis
Ferri et al. Capillary electrophoresis of multigene barcoding chloroplast markers for species identification of botanical trace evidence
US20160070856A1 (en) Variant-calling on data from amplicon-based sequencing methods
CN112687335A (en) Method, device and equipment for identifying maternal MT (multiple terminal) single group based on chain search algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant