CN113284552B - Screening method and device for micro haplotypes - Google Patents

Screening method and device for micro haplotypes Download PDF

Info

Publication number
CN113284552B
CN113284552B CN202110654476.7A CN202110654476A CN113284552B CN 113284552 B CN113284552 B CN 113284552B CN 202110654476 A CN202110654476 A CN 202110654476A CN 113284552 B CN113284552 B CN 113284552B
Authority
CN
China
Prior art keywords
micro
data
haplotypes
sequence
single nucleotide
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110654476.7A
Other languages
Chinese (zh)
Other versions
CN113284552A (en
Inventor
乌日嘎
刘志勇
孙宏钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110654476.7A priority Critical patent/CN113284552B/en
Publication of CN113284552A publication Critical patent/CN113284552A/en
Application granted granted Critical
Publication of CN113284552B publication Critical patent/CN113284552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a screening method and a screening device of micro haplotypes, wherein the method comprises the following steps: acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened; determining N primary micro-haplotypes according to the mark coordinates of the multi-row single nucleotide polymorphism marks; searching a reference sequence corresponding to each primary micro-haplotype, and calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence; and screening the N primary micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes. The invention can accurately and rapidly screen micro haplotypes from genome or transcriptome data, shortens screening time and greatly improves screening efficiency.

Description

一种微单倍型的筛选方法及装置A micro-haplotype screening method and device

技术领域Technical field

本发明涉及法医遗传学的技术领域,尤其涉及一种微单倍型的筛选方法及装置。The present invention relates to the technical field of forensic genetics, and in particular to a microhaplotype screening method and device.

背景技术Background technique

在法医遗传学中,常用的遗传标记有短串联重复序列(STR)和单核苷酸多态性标记(SNP)标记等。然而,STR遗传标记除了具有等位基因突变率高与扩增不平衡的问题外,还有着难以逾越的扩展应用瓶颈,即因其序列结构特征导致PCR过程中容易复制滑脱,从而产生stutter峰对数据分析形成干扰,尤其是在法医学混合斑应用拆分时非常困难;而单个SNP遗传标记所含遗传信息量少,往往需要检测大量的遗传标记才可以达到与STR鉴别能力相当的水平。In forensic genetics, commonly used genetic markers include short tandem repeats (STR) and single nucleotide polymorphism (SNP) markers. However, in addition to the problems of high allelic mutation rate and unbalanced amplification, STR genetic markers also have an insurmountable bottleneck in expanding their application. That is, their sequence structure characteristics lead to easy replication slippage during the PCR process, resulting in stutter peak pairs. Data analysis causes interference, which is especially difficult when splitting mixed spots in forensic medicine. However, a single SNP genetic marker contains a small amount of genetic information, and it is often necessary to detect a large number of genetic markers to achieve a level comparable to STR identification capabilities.

微单倍型(microhaplotype,MH)兼具有STR与SNP的优势,且没有其缺陷,是一种非常理想的法医遗传学标记。其定义为在基因组中200bp范围内由2-5个SNP组合而成的多等位基因分子标记,最早是由美国耶鲁大学Kidd教授实验室所提出。由于MH处于初级发展阶段,目前业内也没有一款特定的技术方案来进行寻找,往往是通过相关研究人员对一组数据基于人工逐个查找,或者雇佣生物信息的专业人员辅助编写非专业的简单脚本进行查找。Microhaplotype (MH) has the advantages of both STR and SNP without their drawbacks, and is an ideal forensic genetic marker. It is defined as a multi-allelic molecular marker composed of 2-5 SNPs within 200 bp in the genome. It was first proposed by the laboratory of Professor Kidd of Yale University in the United States. Since MH is in its early stages of development, there is currently no specific technical solution in the industry to search for it. Relevant researchers often manually search a set of data one by one, or hire bioinformatics professionals to assist in writing non-professional simple scripts. Find it.

人工查找的方式耗时长,准确率低,且容易遗漏,而通过编写非专业的脚本进行查找不但工作量大,而且成本高,难以将此技术推广应用。Manual search is time-consuming, has low accuracy, and is easy to miss. Searching by writing non-professional scripts not only requires a lot of work, but also costs a lot, making it difficult to popularize and apply this technology.

发明内容Contents of the invention

本发明提出一种微单倍型的筛选方法及装置,所述方法可以快速、高效且准确地从基因数据中筛选MH基因座。The present invention proposes a microhaplotype screening method and device, which can quickly, efficiently and accurately screen MH loci from genetic data.

本发明实施例的第一方面提供了一种微单倍型的筛选方法,所述方法包括:A first aspect of the embodiments of the present invention provides a microhaplotype screening method, which method includes:

获取待筛选数据,读取所述待筛选数据中的多行单核苷酸多态性标记的标记坐标;Obtain the data to be screened and read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the data to be screened;

根据所述多行单核苷酸多态性标记的标记坐标确定N个初选微单倍型,其中,N为大于或等于1的正整数;Determine N preliminary microhaplotypes based on the marker coordinates of the multiple rows of single nucleotide polymorphism markers, where N is a positive integer greater than or equal to 1;

分别查找每个所述初选微单倍型对应的参考序列,并分别利用每个所述参考序列计算每个所述初选微单倍型对应的序列特征参数;Search the reference sequence corresponding to each of the preliminary micro-haplotypes respectively, and use each of the reference sequences to calculate the sequence characteristic parameters corresponding to each of the preliminary micro-haplotypes;

根据所述序列特征参数从所述N个初选微单倍型筛选得到M个目标微单倍型,其中,M为大于或等于1的正整数,N大于或等于M。According to the sequence characteristic parameters, M target micro-haplotypes are obtained from the N preliminary micro-haplotypes, where M is a positive integer greater than or equal to 1, and N is greater than or equal to M.

在第一方面的一种可能的实现方式中,所述根据所述多行单核苷酸多态性标记的标记坐标确定N个初选微单倍型,包括:In a possible implementation of the first aspect, the determination of N primary microhaplotypes based on the marker coordinates of the multiple rows of single nucleotide polymorphism markers includes:

根据预设的参考坐标差值将所述多行单核苷酸多态性标记的标记坐标划分成N组标记坐标集合;Divide the marker coordinates of the multiple rows of single nucleotide polymorphism markers into N sets of marker coordinate sets according to preset reference coordinate differences;

分别将每组所述标记坐标集合所包含的单核苷酸多态性标记存入预设的python字典;Store the single nucleotide polymorphism markers included in each set of marker coordinate sets into a preset python dictionary respectively;

按照预设的存储数量在预设的python字典中分别提取每一组所述标记坐标集合所包含的单核苷酸多态性标记,并将每一组所述标记坐标集合所包含的单核苷酸多态性标记定为一个初选微单倍型,得到N个初选微单倍型。Extract the single nucleotide polymorphism markers included in each set of marker coordinate sets from the preset python dictionary according to the preset storage quantity, and store the single-core polymorphism markers included in each set of marker coordinate sets. The nucleotide polymorphism marker was determined as a primary micro-haplotype, and N primary micro-haplotypes were obtained.

在第一方面的一种可能的实现方式中,所述分别查找每个所述初选微单倍型对应的参考序列,包括:In a possible implementation of the first aspect, separately searching for reference sequences corresponding to each of the preliminary micro-haplotypes includes:

分别依据每个所述初选微单倍型的首个单核苷酸多态性标记坐标和末端单核苷酸多态性标记坐标制作序列文件;Create sequence files based on the first single nucleotide polymorphism marker coordinates and the terminal single nucleotide polymorphism marker coordinates of each of the primary microhaplotypes;

将所述序列文件输入至预设的序列查找工具中,查找得到每个所述初选微单倍型对应的参考序列。Input the sequence file into a preset sequence search tool to search for the reference sequence corresponding to each of the preliminary micro-haplotypes.

在第一方面的一种可能的实现方式中,所述序列特征参数包括GC含量值、重复序列特征和全基因组多匹配指标;In a possible implementation of the first aspect, the sequence characteristic parameters include GC content values, repeated sequence characteristics and whole-genome multiple matching indicators;

所述分别利用每个所述参考序列计算每个所述初选微单倍型对应的序列特征值,包括:The step of using each of the reference sequences to calculate the sequence characteristic value corresponding to each of the preliminary micro-haplotypes includes:

分别以每条所述参考序列为模板,通过BLAST分析从预设的全基因组数据中查找多条相似序列,并计算每条相似序列的评测参数,所述评测参数包括期望值和得分值;Using each of the reference sequences as a template, search for multiple similar sequences from the preset whole-genome data through BLAST analysis, and calculate the evaluation parameters of each similar sequence, where the evaluation parameters include expected values and score values;

基于所述期望值和所述得分值统计查找得到的所述相似序列的相似数量,以所述相似数量为全基因组多匹配指标;The similarity number of the similar sequences obtained by statistical search based on the expected value and the score value, and the similarity number is the whole-genome multi-matching index;

分别计算每条所述参考序列的GC含量值;Calculate the GC content value of each reference sequence separately;

按照预设的重复序列特征值从每条所述参考序列中提取短串联重复序列特征。Short tandem repeat sequence features are extracted from each reference sequence according to preset repeat sequence feature values.

在第一方面的一种可能的实现方式中,所述待筛选数据包括基因组数据和转录组数据;In a possible implementation of the first aspect, the data to be screened includes genomic data and transcriptome data;

所述读取所述待筛选数据中的多行单核苷酸多态性标记的标记坐标,包括:The step of reading the marker coordinates of multiple rows of single nucleotide polymorphism markers in the data to be screened includes:

当所述待筛选数据为基因组数据时,则读取所述基因组数据中的多行单核苷酸多态性标记的标记坐标;When the data to be filtered is genomic data, read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the genomic data;

当所述待筛选数据为转录组数据时,获取所述转录组数据所包含的染色体的起始坐标和终止坐标,并以所述起始坐标至所述终止坐标的间距作为坐标区间,从所述坐标区间中筛选坐标值在所述坐标区间内的多个目标单核苷酸多态性标记的标记坐标。When the data to be filtered is transcriptome data, the start coordinates and the end coordinates of the chromosomes contained in the transcriptome data are obtained, and the distance from the start coordinate to the end coordinate is used as the coordinate interval. In the coordinate interval, the marker coordinates of multiple target single nucleotide polymorphism markers whose coordinate values are within the coordinate interval are screened.

在第一方面的一种可能的实现方式中,所述根据所述序列特征参数从所述N个初选微单倍型筛选得到M个目标微单倍型,包括:In a possible implementation of the first aspect, screening the N preliminary micro-haplotypes to obtain M target micro-haplotypes based on the sequence characteristic parameters includes:

分别判断每个所述初选微单倍型对应的GC含量值是否满足预设的含量值条件,判断每个所述初选微单倍型对应的重复序列特征是否满足预设的目标序列特征条件,以及判断所述全基因组多匹配指标是否满足预设的指标条件;Determine whether the GC content value corresponding to each of the preliminary micro-haplotypes meets the preset content value conditions, and determine whether the repeated sequence characteristics corresponding to each of the preliminary micro-haplotypes meet the preset target sequence characteristics. conditions, and determining whether the whole-genome multi-matching index meets the preset index conditions;

从所述N个初选微单倍型中所述初选微单倍型对应的GC含量值满足预设的含量值条件、所述初选微单倍型对应的重复序列特征满足预设的目标序列特征条件和所述全基因组多匹配指标满足预设的指标条件的初选微单倍型,得到M个目标微单倍型。From the N primary micro-haplotypes, the GC content value corresponding to the primary micro-haplotype meets the preset content value conditions, and the repetitive sequence characteristics corresponding to the primary micro-haplotype meet the preset conditions. If the target sequence characteristic conditions and the whole-genome multi-matching index satisfy the preset index conditions, M target micro-haplotypes will be obtained.

在第一方面的一种可能的实现方式中,所述方法还包括:In a possible implementation of the first aspect, the method further includes:

获取分型数据,所述分型数据为包括若干数量人群的单核苷酸多态性标记分型数据;Obtain typing data, which is single nucleotide polymorphism marker typing data including a certain number of people;

按照预设的千人基因组群体来源和样本名称将所述分型数据拆分成多个群体分型数据,其中,每个所述群体分型数据包括每个样本对应的单核苷酸多态性标记分型数据。Split the typing data into multiple population typing data according to the preset Thousand Genomes population source and sample name, wherein each population typing data includes the single nucleotide polymorphism corresponding to each sample Sex marker typing data.

在第一方面的一种可能的实现方式中,所述方法还包括:In a possible implementation of the first aspect, the method further includes:

采用所述目标微单倍型,计算对应的法医学参数,其中,所述法医学参数包括等位基因分型及其频率、杂合度观察值、杂合度期望值、匹配概率、多态信息含量、个体识别概率、三联体非父排除概率、二联体非父排除概率值和有效等位基因数。The target microhaplotype is used to calculate the corresponding forensic parameters, where the forensic parameters include allele typing and frequency, observed heterozygosity, expected heterozygosity, matching probability, polymorphic information content, and individual identification. Probability, triplet non-parent exclusion probability, doublet non-parent exclusion probability value and number of valid alleles.

在第一方面的一种可能的实现方式中,在根据所述多个单核苷酸多态性标记坐标确定初选微单倍型的步骤后,所述方法还包括:In a possible implementation of the first aspect, after the step of determining the primary microhaplotype based on the multiple single nucleotide polymorphism marker coordinates, the method further includes:

对每个所述初选微单倍型进行命名。Name each of the primary microhaplotypes.

本发明实施例的第二方面还提供了一种微单倍型的筛选装置,所述装置包括:A second aspect of the embodiment of the present invention also provides a microhaplotype screening device, which device includes:

读取模块,用于获取待筛选数据,读取所述待筛选数据中的多行单核苷酸多态性标记的标记坐标;A reading module, used to obtain the data to be screened and read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the data to be screened;

确定模块,用于根据所述多行单核苷酸多态性标记的标记坐标确定N个初选微单倍型,其中,N为大于或等于1的正整数;A determination module, configured to determine N primary microhaplotypes based on the marker coordinates of the multiple rows of single nucleotide polymorphism markers, where N is a positive integer greater than or equal to 1;

计算模块,用于分别查找每个所述初选微单倍型对应的参考序列,并分别利用每个所述参考序列计算每个所述初选微单倍型对应的序列特征参数;A calculation module, configured to respectively search for the reference sequence corresponding to each of the preliminary micro-haplotypes, and to use each of the reference sequences to calculate the sequence characteristic parameters corresponding to each of the preliminary micro-haplotypes;

筛选模块,用于根据所述序列特征参数从所述N个初选微单倍型筛选得到M个目标微单倍型,其中,M为大于或等于1的正整数,N大于或等于MA screening module for screening the N primary micro-haplotypes to obtain M target micro-haplotypes based on the sequence characteristic parameters, where M is a positive integer greater than or equal to 1, and N is greater than or equal to M

相比于现有技术,本发明实施例提供的微单倍型的筛选方法及装置,其有益效果在于:本发明可以通过读取单核苷酸多态性标记的标记坐标,基于单核苷酸多态性标记的标记坐标进行粗略的筛选得到初选微单倍型,接着查找初选微单倍型的参考序列,根据参考序列计算序列特征值,最后根据序列特征值筛选目标微单倍型,实现快速筛选微单倍型的效果。整个过程简单快捷,不但可以缩短筛选时间,提高筛选效率,同时也可以提高筛选的准确率;并且本申请可以实现从基因组与转录组的原始数据筛选评估MH全过程,形成一整套技术方案,使现有技术方案得到整合与提升,大大提高了筛选的实用性和灵活性;同时本申请还提供了一种统一的基因组与转录组来源的MH基因座以及相应的等位基因命名方案,方便不同实验室之间信息交流与计算机快速数据处理。Compared with the existing technology, the beneficial effect of the microhaplotype screening method and device provided by the embodiments of the present invention is that: the present invention can read the marker coordinates of single nucleotide polymorphism markers based on single nucleotide polymorphisms. The marker coordinates of the acid polymorphism markers are roughly screened to obtain the primary micro-haplotypes, and then the reference sequence of the primary micro-haplotypes is searched, the sequence feature values are calculated based on the reference sequences, and finally the target micro-haplotypes are screened based on the sequence feature values. type, achieving the effect of rapid screening of micro-haplotypes. The whole process is simple and fast, which can not only shorten the screening time and improve the screening efficiency, but also improve the accuracy of screening; and this application can realize the whole process of screening and evaluating MH from the original data of genome and transcriptome, forming a complete set of technical solutions, so that The existing technical solutions have been integrated and improved, greatly improving the practicality and flexibility of screening; at the same time, this application also provides a unified MH locus derived from the genome and transcriptome and the corresponding allele naming scheme to facilitate different Information exchange between laboratories and rapid computer data processing.

附图说明Description of the drawings

图1是本发明一实施例提供的一种微单倍型的筛选方法的流程示意图;Figure 1 is a schematic flow chart of a microhaplotype screening method provided by an embodiment of the present invention;

图2是本发明一实施例提供的一种微单倍型的筛选方法的操作流程图;Figure 2 is an operation flow chart of a microhaplotype screening method provided by an embodiment of the present invention;

图3是本发明一实施例提供的一种微单倍型的筛选装置的结构示意图。Figure 3 is a schematic structural diagram of a microhaplotype screening device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

目前业内也没有一款特定的技术方案来进行寻找,往往是通过相关研究人员对一组数据基于人工逐个查找,或者雇佣生物信息的专业人员辅助编写非专业的简单脚本进行查找。人工查找的方式耗时长,准确率低,且容易遗漏,而非专业人员受限于对法医遗传学专业知识的掌握,编写的脚本进行查找不但准确度低,而且成本高,难以将此技术推广应用。At present, there is no specific technical solution in the industry to search. Relevant researchers often manually search a set of data one by one, or hire bioinformatics professionals to assist in writing non-professional simple scripts to search. The manual search method is time-consuming, has low accuracy, and is easy to miss. Non-professionals are limited by their professional knowledge of forensic genetics. The scripts written to search are not only inaccurate but also costly, making it difficult to promote this technology. application.

为了解决上述问题,下面将通过以下具体的实施例对本申请实施例提供的一种微单倍型的筛选方法进行详细介绍和说明。In order to solve the above problems, a micro-haplotype screening method provided by the embodiments of the present application will be introduced and explained in detail through the following specific examples.

参照图1,示出了本发明一实施例提供的一种微单倍型的筛选方法的流程示意图。Referring to FIG. 1 , a schematic flow chart of a microhaplotype screening method provided by an embodiment of the present invention is shown.

其中,作为示例的,所述微单倍型的筛选方法,可以包括:As an example, the micro-haplotype screening method may include:

S11、获取待筛选数据,读取所述待筛选数据中的多行单核苷酸多态性标记的标记坐标。S11. Obtain the data to be filtered, and read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the data to be filtered.

该待筛选数据是用户预先准备的人类基因组且含有SNP的注释文件,具体可以是VCF格式相类似的文件。该文件可以包括(1)CHROM:染色体;(2)POS:基因组位置;(3)ID:变异位点的rsID号,如果没有的话用"."表示;(4)REF:参考分型;(5)ALT:Variant(变异)的分型;(6)QUAL:call(命名)出这个位点的质量;(7)FILTER:对变异位点进行过滤,如果通过则为PASS,如果没有进行过滤就是".";(8)INFO:variant的详细信息。具体地,用户可以预先下载千人基因组数据的SNP-VCF文件。The data to be filtered is a human genome annotation file containing SNPs prepared in advance by the user. Specifically, it may be a file in a similar VCF format. The file can include (1) CHROM: chromosome; (2) POS: genome position; (3) ID: rsID number of the variant site, if not represented by "."; (4) REF: reference typing; ( 5) ALT: Variant typing; (6) QUAL: Call (name) the quality of this site; (7) FILTER: Filter the variant sites, if passed, it will be PASS, if not filtered It is "."; (8) INFO: detailed information of the variant. Specifically, users can download SNP-VCF files of thousands of genome data in advance.

为了提高筛选效率,在实际操作中,可以在python软件中准备一个空的worksheet文件,准备存放SNP-VCF文件;接着导入python中的pandas,os,xlwt,xlsxwriter,openpyxl,openpyxl,csv等python的模块;然后利用for循环逐个读取不同染色体编号的SNP-VCF文件并利用for循环逐行读取每条染色体的SNP-VCF文件,读取得到多行单核苷酸多态性标记的坐标。In order to improve the screening efficiency, in actual operation, you can prepare an empty worksheet file in python software to store the SNP-VCF file; then import pandas, os, xlwt, xlsxwriter, openpyxl, openpyxl, csv and other python in python module; then use a for loop to read SNP-VCF files with different chromosome numbers one by one and use a for loop to read the SNP-VCF file of each chromosome line by line, and read the coordinates of multiple lines of single nucleotide polymorphism markers.

在本实施例中,待筛选数据可以包括基因组数据和转录组数据,其中,作为示例的,步骤S11可以包括以下子步骤:In this embodiment, the data to be filtered may include genomic data and transcriptome data, where, as an example, step S11 may include the following sub-steps:

子步骤S111、当所述待筛选数据为基因组数据时,则读取所述基因组数据中的多行单核苷酸多态性标记的标记坐标。Sub-step S111: When the data to be filtered is genome data, read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the genome data.

在本实施例中,所述标记坐标为单核苷酸多态性标记的坐标与起始坐标的坐标差值。In this embodiment, the marker coordinates are the coordinate difference between the coordinates of the single nucleotide polymorphism marker and the starting coordinates.

在具体实现中,为方便筛选,用户可以利用python设置每条染色体对应一个SNP-VCF文件,然后利用for循环逐行读取每条染色体的SNP-VCF文件,再利用split函数将SNP-VCF文件的内容进行拆分,得到每条染色体的染色体编号,单核苷酸多态性标记(SNP)的名称与位置等信息,从而得到多个单核苷酸多态性标记(SNP)。In the specific implementation, in order to facilitate screening, users can use python to set each chromosome to correspond to a SNP-VCF file, then use a for loop to read the SNP-VCF file of each chromosome line by line, and then use the split function to split the SNP-VCF file. The content is split to obtain the chromosome number of each chromosome, the name and location of the single nucleotide polymorphism marker (SNP), and other information, thereby obtaining multiple single nucleotide polymorphism markers (SNP).

在本实施例中,具体查找和读取可以是:设置初始SNP位置坐标为“loc=0”,以第一行得到的单核苷酸多态性标记(SNP)位置坐标减去“loc”,得到第一行单核苷酸多态性标记的标记坐标,接着再用第二行单核苷酸多态性标记(SNP)的位置坐标减去“loc”,得到第二行单核苷酸多态性标记的标记坐标,如此类推,得到每一行单核苷酸多态性标记的标记坐标。In this embodiment, the specific search and reading can be: setting the initial SNP position coordinate to "loc=0", subtracting "loc" from the single nucleotide polymorphism marker (SNP) position coordinate obtained in the first row , get the marker coordinates of the first row of single nucleotide polymorphism markers, and then subtract "loc" from the position coordinates of the second row of single nucleotide polymorphism markers (SNP) to get the second row of single nucleotides The marker coordinates of the acid polymorphism markers, and so on, get the marker coordinates of the single nucleotide polymorphism markers in each row.

子步骤S112、当所述待筛选数据为转录组数据时,获取所述转录组数据所包含的染色体的起始坐标和终止坐标,并以所述起始坐标至所述终止坐标的间距作为坐标区间。Sub-step S112: When the data to be filtered is transcriptome data, obtain the start coordinates and end coordinates of the chromosomes contained in the transcriptome data, and use the distance from the start coordinate to the end coordinate as the coordinates interval.

子步骤S113、从所述坐标区间中筛选坐标值在所述坐标区间内的多个目标单核苷酸多态性标记的标记坐标。Sub-step S113: Screen the marker coordinates of multiple target single nucleotide polymorphism markers whose coordinate values are within the coordinate interval from the coordinate interval.

在本实施例中,该转录组数据是用户预先整理的数据。所述转录组数据可以是类似BED格式文件,其中,该BED格式文件共包含六列数据内容,分别为:(1)Chrom:染色体编号;(2)ChromStart:染色体起始坐标;(3)ChromEnd:染色体结束坐标;(4)Name:行名称:(5)Score:0-1000,基因组浏览器中显示的灰度值;(6)Strand:正负链标记。In this embodiment, the transcriptome data is data pre-organized by the user. The transcriptome data can be a file similar to the BED format, where the BED format file contains a total of six columns of data content, which are: (1) Chrom: chromosome number; (2) ChromStart: chromosome start coordinate; (3) ChromEnd : Chromosome end coordinate; (4) Name: row name: (5) Score: 0-1000, grayscale value displayed in the genome browser; (6) Strand: positive and negative strand markers.

在具体实现中,可以获取其“ChromStart”与“ChromEnd”坐标,得到起始标记坐标和终止标记坐标。In the specific implementation, its "ChromStart" and "ChromEnd" coordinates can be obtained, and the start mark coordinates and end mark coordinates can be obtained.

在实际操作中,转录组数据和基因组数据可以同时进行筛选和读取,可以利用for循环分别与“子步骤S111”的SNP-VCF文件进行遍历,结合if条件,筛选出转录组数据来源的cSNP遗传标记。In actual operation, transcriptome data and genome data can be filtered and read at the same time. You can use a for loop to traverse the SNP-VCF file of "Substep S111" respectively, and combine it with if conditions to filter out the cSNPs from the transcriptome data source. Genetic markers.

其中,具体查找过程为:利用for循环逐行读取BED文件后,得到某个转录组遗传标记的基因组起始与终止坐标(即“ChromStart”与“ChromEnd”),并利用“ChromStart”与“ChromEnd”坐标确定一个坐标区间,即“ChromStart-ChromEnd”区间。Among them, the specific search process is: after using a for loop to read the BED file line by line, obtain the genome start and end coordinates of a certain transcriptome genetic marker (i.e. "ChromStart" and "ChromEnd"), and use "ChromStart" and "ChromEnd" ChromEnd" coordinates determine a coordinate interval, that is, the "ChromStart-ChromEnd" interval.

在确定区间后,可以采用子步骤S111的坐标计算方法计算每一行单核苷酸多态性标记的标记坐标。具体的计算方式可以参照上述内容,为了避免重复,在此不再赘述。After the interval is determined, the coordinate calculation method of sub-step S111 can be used to calculate the marker coordinates of each row of single nucleotide polymorphism markers. The specific calculation method can refer to the above content. To avoid repetition, it will not be described again here.

在本实施例中,可以判断所述染色体对应的遗传标记坐标是否处于“ChromStart-ChromEnd”区间,若该染色体对应的遗传标记坐标处于该区间内,则可以确定该单核苷酸多态性标记(SNP)则为cSNP。In this embodiment, it can be determined whether the genetic marker coordinates corresponding to the chromosome are in the "ChromStart-ChromEnd" interval. If the genetic marker coordinates corresponding to the chromosome are in this interval, the single nucleotide polymorphism marker can be determined. (SNP) is cSNP.

为了方便后续操作,可以基于找到的cSNP,将采用预设的SNP-VCF的格式文件与转录组数据的BED文件进行拼接,整理输出得到VCF格式的结果文件,结果文件可以是包含转录组cSNP的VCF样文件。In order to facilitate subsequent operations, based on the found cSNPs, the preset SNP-VCF format file can be spliced with the BED file of transcriptome data, and the output can be sorted and output to obtain a result file in VCF format. The result file can contain transcriptome cSNPs. VCF sample file.

S12、根据所述多行单核苷酸多态性标记坐标确定N个初选微单倍型,其中,N为大于或等于1的正整数。S12. Determine N preliminary micro-haplotypes based on the multiple rows of single nucleotide polymorphism marker coordinates, where N is a positive integer greater than or equal to 1.

由于读取得到的单核苷酸多态性标记坐标有多行,可以根据多行单核苷酸多态性标记坐标进行一次粗略的筛选,确定一定数量的初选微单倍型,再从一定数量的初选微单倍型中筛选得到目标微单倍型,从而提高筛选的准确率。Since the single nucleotide polymorphism marker coordinates obtained by reading have multiple rows, a rough screening can be carried out based on the multiple rows of single nucleotide polymorphism marker coordinates to determine a certain number of primary microhaplotypes, and then from The target micro-haplotype is obtained by screening a certain number of primary micro-haplotypes, thereby improving the accuracy of screening.

为了提高筛选的效率,其中,作为示例的,步骤S12可以包括以下子步骤:In order to improve the efficiency of screening, as an example, step S12 may include the following sub-steps:

子步骤S121、根据预设的参考坐标差值将所述多行单核苷酸多态性标记的标记坐标划分成N组标记坐标集合。Sub-step S121: Divide the marker coordinates of the multiple rows of single nucleotide polymorphism markers into N sets of marker coordinates based on preset reference coordinate differences.

在本实施例中,可以每隔100行,划分成一组标记坐标集合,可以每隔50行,200行或500行,具体可以根据实际需要调整。In this embodiment, a set of mark coordinates can be divided into a set of mark coordinates every 100 rows, every 50 rows, 200 rows or 500 rows, and the details can be adjusted according to actual needs.

子步骤S122、分别将每组所述标记坐标集合所包含的单核苷酸多态性标记存入预设的python字典。Sub-step S122: Store the single nucleotide polymorphism markers included in each set of marker coordinates into a preset Python dictionary.

为了方便计算机计算,在本实施例中,可以进行遍历计算,确定每一行的单核苷酸多态性标记坐标是否属于一组单核苷酸多态性标记坐标集合,当输入同一组单核苷酸多态性标记坐标集合的,可以先存入预设的python字典中。In order to facilitate computer calculation, in this embodiment, a traversal calculation can be performed to determine whether the SNP marker coordinates of each row belong to a set of SNP marker coordinates. When inputting the same set of single-core The coordinate set of nucleotide polymorphism markers can be stored in the preset python dictionary first.

例如,可以利用预设的参考坐标差值为200,将第2行单核苷酸多态性标记的坐标与第1行单核苷酸多态性标记的坐标作差,得到第2行单核苷酸多态性标记的坐标与第1行单核苷酸多态性标记的坐标的标记坐标为50,50小于200,将第2行单核苷酸多态性标记的坐标与第1行单核苷酸多态性标记的坐标划分为一组,可以将第1行单核苷酸多态性标记的坐标与第2行单核苷酸多态性标记的坐标各自所包含的单核苷酸多态性标记存入预设的python字典的一个组中,然后将第3行单核苷酸多态性标记的坐标与第1行单核苷酸多态性标记的坐标作差得到标记坐标为120,120小于200,则将第3行单核苷酸多态性标记的坐标所包含的单核苷酸多态性标记也存入预设的python字典的上一组中,第4行单核苷酸多态性标记坐标与第1行单核苷酸多态性标记坐标作差得到的标记坐标为375,大于预设的参考坐标差值200,则确定第1行至第3行单核苷酸多态性标记坐标为一组标记坐标集合。For example, you can use the preset reference coordinate difference value of 200 to differ the coordinates of the single nucleotide polymorphism marker in the second row and the coordinates of the single nucleotide polymorphism marker in the first row to obtain the single nucleotide polymorphism marker in the second row. The coordinates of the nucleotide polymorphism marker and the coordinates of the single nucleotide polymorphism marker in the first row are 50, and 50 is less than 200. The coordinates of the single nucleotide polymorphism marker in the second row are the same as the coordinates of the single nucleotide polymorphism marker in the first row. The coordinates of the single nucleotide polymorphism markers in the first row can be divided into one group. The coordinates of the single nucleotide polymorphism markers in the first row and the coordinates of the single nucleotide polymorphism markers in the second row can be divided into The nucleotide polymorphism markers are stored in a group of the preset python dictionary, and then the coordinates of the single nucleotide polymorphism markers in the 3rd row are compared with the coordinates of the single nucleotide polymorphism markers in the 1st row. The obtained marker coordinates are 120, and 120 is less than 200. Then the single nucleotide polymorphism markers contained in the coordinates of the single nucleotide polymorphism marker in the third row are also stored in the previous group of the preset python dictionary. The difference between the single nucleotide polymorphism marker coordinates in the 4th row and the single nucleotide polymorphism marker coordinates in the 1st row results in a marker coordinate of 375, which is greater than the preset reference coordinate difference of 200, then determine the 1st row to The third line of single nucleotide polymorphism marker coordinates is a set of marker coordinates.

接着,重新将第4行单核苷酸多态性标记的坐标作为第二组标记坐标集合(微单倍型)的初始单核苷酸多态性标记的坐标重新计算,计算第4行单核苷酸多态性标记的坐标与第5行单核苷酸多态性标记的坐标的坐标差值,并将坐标差值与预设的参考坐标差值做比较,如此类推。Then, recalculate the coordinates of the single nucleotide polymorphism marker in the fourth row as the coordinates of the initial single nucleotide polymorphism marker in the second set of marker coordinates (microhaplotype), and calculate the single nucleotide polymorphism marker in the fourth row. The coordinate difference between the coordinates of the nucleotide polymorphism marker and the coordinates of the single nucleotide polymorphism marker in row 5 is compared with the preset reference coordinate difference, and so on.

需要说明的是,在实际操作中,预设的参考坐标差值可以为50、80、300、550或n等等,具体可以根据实际需要进行调整。It should be noted that in actual operation, the preset reference coordinate difference can be 50, 80, 300, 550 or n, etc., which can be adjusted according to actual needs.

在具体实现中,预设的python字典可以是由“键”和“值”构成,其中,“键”可以存储每一行单核苷酸多态性标记坐标所包含的微单倍型基因座的名称,“值”可以存储每个微单倍型基因座的属性,比如各个单核苷酸多态性标记的位置坐标的组合等。“键”和“值”是一一对应的关系。In a specific implementation, the preset python dictionary can be composed of "key" and "value", where the "key" can store the micro-haplotype loci contained in the single nucleotide polymorphism marker coordinates of each row. Name, "value" can store the attributes of each microhaplotype locus, such as the combination of position coordinates of each single nucleotide polymorphism marker, etc. "Key" and "value" have a one-to-one correspondence.

子步骤S123、按照预设的存储数量在预设的python字典中分别提取每一组所述标记坐标集合所包含的单核苷酸多态性标记的数量,并将每一组所述标记坐标集合所包含的单核苷酸多态性标记定为一个初选微单倍型,得到N个初选微单倍型。Sub-step S123: Extract the number of single nucleotide polymorphism markers included in each set of marker coordinates from the preset python dictionary according to the preset storage number, and store each set of marker coordinates The single nucleotide polymorphism markers included in the set are determined as a primary micro-haplotype, and N primary micro-haplotypes are obtained.

在实际操作中,可以重复对每组标记坐标集合进行存储数量的判断,若标记坐标集合存储的单核苷酸多态性标记数量满足用户设定的最低要求,则从标记坐标集合提取与存储数量对应的单核苷酸多态性标记,得到一个初选微单倍型,直到所有标记坐标集合判断完成。In actual operation, the storage quantity can be repeatedly judged for each set of marker coordinate sets. If the number of single nucleotide polymorphism markers stored in the marker coordinate set meets the minimum requirements set by the user, then the number of stored single nucleotide polymorphism markers stored in the marker coordinate set will be extracted and stored from the marker coordinate set. The corresponding number of single nucleotide polymorphism markers is used to obtain a preliminary micro-haplotype until the judgment of all marker coordinate sets is completed.

例如,可以使用if语句判断,判断“预设的存储字典中包含的单核苷酸多态性标记(SNP)的个数,若个数大于预设存储数量的最低要求(例如,3个),即确定找到一个微单倍型(MH),并以该组单核苷酸多态性标记坐标集合为初选的微单倍型。For example, you can use the if statement to determine "the number of single nucleotide polymorphism markers (SNPs) contained in the preset storage dictionary. If the number is greater than the minimum requirement of the preset storage quantity (for example, 3) , that is, a micro-haplotype (MH) is determined to be found, and the set of single nucleotide polymorphism marker coordinates is used as the primary micro-haplotype.

在本实施例中,利用for循环配合if条件,按照200bp范围内大于或等于2个SNP的标准,可以实现高效筛选微单倍型(MH)的效果。In this embodiment, by using a for loop combined with if conditions, and according to the standard of greater than or equal to 2 SNPs within a 200 bp range, the effect of efficient screening of microhaplotypes (MH) can be achieved.

在实际操作中,为了后续能分辨和整理每个微单倍型(MH),其中,作为示例的,在步骤S12后,所述方法还可以包括:In actual operation, in order to subsequently distinguish and organize each microhaplotype (MH), as an example, after step S12, the method may also include:

步骤S21、对每个所述初选微单倍型进行命名。Step S21: Name each of the preliminary micro-haplotypes.

具体地,可以将上述筛选得到的初选微单倍型进行整理,并同时将染色体编号,用户自定义的微单倍型(MH)名称,以及微单倍型(MH)所包含的SNP以及位置坐标等信息写入预设的worksheet文件中,实现对每个所述初选微单倍型添加标记的效果。Specifically, the primary micro-haplotypes obtained through the above screening can be sorted, and at the same time, the chromosome number, user-defined micro-haplotype (MH) name, and the SNPs included in the micro-haplotype (MH) and The position coordinates and other information are written into the preset worksheet file to achieve the effect of adding a mark to each of the primary micro-haplotypes.

在后续管理时,可以根据名称或编号查找到对应所需的微单倍型(MH)。During subsequent management, the corresponding required microhaplotype (MH) can be found based on the name or number.

在命名时,可以根据法医学科研与实践应用习惯,也考虑到计算机处理大规模信息的便捷性,在本实施例中,本发明提出一种新的MH基因座命名方式。When naming, it can be based on forensic scientific research and practical application habits, and also taking into account the convenience of computers in processing large-scale information. In this embodiment, the present invention proposes a new naming method for MH loci.

例如,如果是来自基因组来源的微单倍型(MH),可以以“mh21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA”为名字,其中“mh”为微单倍型(microhaplotype)英文字母的缩写;mh之后的数字“21”代表该基因座所在的染色体(该部分取值为01~22的正整数和X,Y,MT);大写字母“SHY”代表发现该标记的实验室名称简写;后面的数字“5”为发现微单倍型基因座的数字顺序编号,表示该标记是本实验室在21号染色体发现的第5个微单倍型。For example, if it is a microhaplotype (MH) from genome sources, it can be named "mh21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA", where "mh" is the abbreviation of microhaplotype (microhaplotype); mh The following number "21" represents the chromosome where the gene locus is located (the value of this part is a positive integer from 01 to 22 and X, Y, MT); the capital letter "SHY" represents the abbreviation of the name of the laboratory that discovered the marker; the following The number "5" is the numerical sequence number of the microhaplotype locus found, indicating that the marker is the fifth microhaplotype discovered by our laboratory on chromosome 21.

对于“/”之后的内容为该MH所包含的SNP的升序位置坐标,在其相应的坐标之后为该SNP的参考分型(第一个字母)与变异分型(非第一个字母)。在该命名方式中,可以看作其是由两部分构成,以第一个“/”为分界。第一个“/”之前的部分为该标记的简要名称。例如,可以简单把该标记称为“mh21SHY5”,有利于法医学实践中相关人员进行口头或者书面交流;其加上“/”之后的内容后是该标记的完整名称,一方面有利于不同实验室对该类遗传标记进行比较,另一有利于计算机大规模进行信息化展示与处理。The content after "/" is the ascending position coordinate of the SNP contained in the MH, and after the corresponding coordinates is the reference type (first letter) and variant type (not the first letter) of the SNP. In this naming method, it can be seen as consisting of two parts, with the first "/" as the dividing line. The part before the first "/" is the brief name of the tag. For example, the mark can be simply called "mh21SHY5", which is conducive to oral or written communication among relevant personnel in forensic medicine practice; the content after the "/" is the complete name of the mark, which is conducive to different laboratories on the one hand. Comparing this type of genetic markers is also conducive to large-scale information display and processing by computers.

上述MH的命名是一种较为中肯的方式,可方便专业人员对于该类型标记的基础研究,也有利于该标记研究成熟后大规模法医学应用(包括建立MH标记的大型数据库)。The above naming of MH is a more pertinent way, which can facilitate professionals' basic research on this type of markers, and is also conducive to large-scale forensic applications (including the establishment of a large database of MH markers) after the marker research matures.

另外,与基因组来源的MH基因座命名相类似,转录组来源的MH同样存在规范命名的问题,目前国内外尚无公开推荐的命名方式。在此,本申请还提出一种与基因组来源MH相承的命名方式,例如,“mh21SHY1/H1_circ_007207/15024188CT/15024224GAC/15024284GC”。此命名方式与上述的基因组来源的MH相比,此方式在标记自然数字编号后与第一个SNP位置坐标之间增加了“H1_circ_007207”,代表MH的转录组遗传标记来源(此处示例中为一个circRNA分子),同时也与基因组来源的MH进行区分,其他含义均与基因组来源的MH命名相同。In addition, similar to the naming of MH loci derived from genomes, MHs derived from transcriptomes also have the problem of standardized naming. Currently, there is no publicly recommended naming method at home and abroad. Here, this application also proposes a naming method that is consistent with the MH from which the genome is derived, for example, "mh21SHY1/H1_circ_007207/15024188CT/15024224GAC/15024284GC". Compared with the above-mentioned MH of genome origin, this naming method adds "H1_circ_007207" between the natural number number of the mark and the first SNP position coordinate, which represents the transcriptome genetic mark source of MH (in the example here: A circRNA molecule), and is also distinguished from the MH derived from the genome. The other meanings are the same as the nomenclature of the MH derived from the genome.

同样地,与基因组和转录组来源的MH基因座同等的等位基因也需要命名。可选地,可以采用其SNP的分型进行命名。例如,上述基因组来源的MH:mh21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA,其等位基因分型可直接按照MH基因座名称中展现的SNP坐标升序的方式进行书写,即“ATGC”,分别代表15765902的分型为“A”,15765915的分型为“T”,其他同理。Likewise, alleles equivalent to MH loci of genomic and transcriptomic origin also need to be named. Optionally, the SNP typing can be used for naming. For example, the allele classification of the MH:mh21SHY5/15765902AC/15765915GAT/15766020AG/15766086CA sourced from the above genome can be directly written in ascending order of the SNP coordinates shown in the MH locus name, that is, "ATGC", which respectively represents 15765902 The type of 15765915 is "A", and the type of 15765915 is "T". The same applies to others.

本发明提出的基于基因组与转录组来源的MH基因座命名与相应的等位基因命名方式,其规则简单,易于掌握,不但有利于不同实验室之间的交流推广,更有利于计算机批量化信息处理。The MH locus naming and corresponding allele naming method based on genome and transcriptome sources proposed by the present invention have simple rules and are easy to master. It is not only conducive to communication and promotion between different laboratories, but also conducive to computer batch information deal with.

S13、分别查找每个所述初选微单倍型对应的参考序列,并分别利用每个所述参考序列计算每个所述初选微单倍型对应的序列特征参数。S13. Search for the reference sequence corresponding to each of the preliminary micro-haplotypes, and use each of the reference sequences to calculate the sequence characteristic parameters corresponding to each of the preliminary micro-haplotypes.

该参考序列可以为微单倍型(MH)的fasta格式序列。该序列特征参数可以是对微单倍型(MH)进行后续法医学参考值计算的评分值。The reference sequence may be a microhaplotype (MH) fasta format sequence. The sequence characteristic parameter may be a score value for subsequent forensic reference value calculation of the microhaplotype (MH).

在本实施例中,根据使用bedtools软件查找每个所述初选微单倍型对应的参考序列,然后用GRCh38_human_ref作为参考的模板序列,该GRCh38_human_ref文件为人类基因比对时使用的标准参考文件。In this example, bedtools software is used to search for the reference sequence corresponding to each of the preliminary micro-haplotypes, and then GRCh38_human_ref is used as the reference template sequence. The GRCh38_human_ref file is a standard reference file used for human gene alignment.

具体地,为了能提高查找的准确率,其中,作为示例的,步骤S13可以包括以下子步骤:Specifically, in order to improve the search accuracy, step S13 may include the following sub-steps as an example:

子步骤S131、分别依据每个所述初选微单倍型的首个单核苷酸多态性标记坐标和末端单核苷酸多态性标记坐标制作序列文件。Sub-step S131: Create a sequence file based on the first single nucleotide polymorphism marker coordinates and the terminal single nucleotide polymorphism marker coordinates of each of the preliminary micro-haplotypes.

具体可以提取初选微单倍型的第一行SNP的位置坐标与最后一行SNP的位置坐标。Specifically, the position coordinates of the first row of SNPs and the position coordinates of the last row of SNPs of the primary micro-haplotype can be extracted.

在本实施例中,可以将两行坐标进行拼接,其拼接结果为包含:CHROM,startposition of cMH,end position of cMH,cMH_name”的BED格式的序列文件。In this embodiment, two lines of coordinates can be spliced, and the splicing result is a sequence file in BED format containing: "CHROM, start position of cMH, end position of cMH, cMH_name".

子步骤S132、将所述序列文件输入至预设的序列查找工具中,查找得到每个所述初选微单倍型对应的参考序列。Sub-step S132: Input the sequence file into a preset sequence search tool, and search for the reference sequence corresponding to each of the preliminary micro-haplotypes.

将序列文件输入至bedtools软件中,由bedtools软件根据全人类基因组(GRCh38_human_ref)寻找微单倍型(MH)对应的fasta格式序列。Input the sequence file into the bedtools software, and the bedtools software will search for the fasta format sequence corresponding to the microhaplotype (MH) based on the entire human genome (GRCh38_human_ref).

微单倍型(MH)的序列(以fasta格式为例)本身会影响后续特异性引物设计与扩增,进而影响后续的法医学应用。为了判断查找到的MH序列在基因组是否具有唯一性,以及为了评估查找计算得到的相似序列准确率,在本实施例中,所述序列特征参数包括GC含量值、重复序列特征和全基因组多匹配指标,其中,作为示例的,步骤S13还可以包括以下子步骤:The sequence of the microhaplotype (MH) (taking the fasta format as an example) itself will affect subsequent specific primer design and amplification, thereby affecting subsequent forensic applications. In order to determine whether the found MH sequence is unique in the genome, and to evaluate the accuracy of similar sequences calculated by the search, in this embodiment, the sequence characteristic parameters include GC content values, repeated sequence characteristics and whole-genome multiple matches. Indicators, where, as an example, step S13 may also include the following sub-steps:

子步骤S134、分别以每条所述参考序列为模板,通过BLAST分析从预设的全基因组数据中查找多条相似序列,并计算每条相似序列的评测参数,所述评测参数包括期望值和得分值。Sub-step S134: Using each reference sequence as a template, search for multiple similar sequences from the preset whole-genome data through BLAST analysis, and calculate the evaluation parameters of each similar sequence. The evaluation parameters include expected values and obtained values. Score.

具体地,可以使用blastn软件中的blastn算法在基因组查找相似的序列。Specifically, the blastn algorithm in blastn software can be used to find similar sequences in the genome.

可以利用blastn算法在全人类基因组(GRCh38_human_ref)范围内查找多条与微单倍型(MH)的fasta格式序列相似的序列。The blastn algorithm can be used to search for multiple sequences similar to the fasta format sequence of the microhaplotype (MH) within the entire human genome (GRCh38_human_ref).

在实际操作中,可以在blastn算法中将初选微单倍型(MH)的fasta格式序列的E值阈值设置0.00001,然后寻找相似序列,并从结果中选择若干条与所述参考序列相似的相似序列。In actual operation, you can set the E value threshold of the fasta format sequence of the primary microhaplotype (MH) in the blastn algorithm to 0.00001, then look for similar sequences, and select several from the results that are similar to the reference sequence. Similar sequences.

例如,某些参考序列可以从全人类基因组查到到10-20条相似序列或者更多条相似序列。For example, some reference sequences can find 10-20 similar sequences or more similar sequences from the entire human genome.

在查找的同时,可以计算每条查找到的相似序列的评测参数,该评测参数具体可以包括期望值(E)和得分值(S)。While searching, the evaluation parameters of each found similar sequence can be calculated, and the evaluation parameters can specifically include the expected value (E) and the score value (S).

其中得分值是相似序列与参考序列的相似性评价,其分值越高表明参考序列和相似序列之间的相似性程度越大。具体地可以采用blastn算法查找到每条相似序列的得分信息,得到每条相似序列的得分值。The score value is the similarity evaluation between the similar sequence and the reference sequence. The higher the score, the greater the degree of similarity between the reference sequence and the similar sequence. Specifically, the blastn algorithm can be used to find the score information of each similar sequence and obtain the score value of each similar sequence.

期望值是得分值的可靠性评价,是在随机情况下,数据库中其它序列与参考序列相似度大于该相似序列与参考序列相似度的可能性,其值越低越好。具体可以采用blastn算法查找到每条相似序列的期望评价信息,得到相似序列对应的期望值。The expected value is the reliability evaluation of the score value. It is the possibility that, under random circumstances, the similarity between other sequences in the database and the reference sequence is greater than the similarity between the similar sequence and the reference sequence. The lower the value, the better. Specifically, the blastn algorithm can be used to find the expected evaluation information of each similar sequence and obtain the expected value corresponding to the similar sequence.

子步骤S135、基于所述期望值和所述得分值统计查找得到的所述相似序列的相似数量,以所述相似数量为全基因组多匹配指标。Sub-step S135: Statistically search the similarity number of the similar sequences based on the expected value and the score value, and use the similarity number as the whole-genome multi-matching index.

该全基因组多匹配指标为期望值满足预设期望值要求且得分值满足预设得分值要求的相似序列的数量,由于相似序列有多条,而每条相似序列的评测参数各有高低,若每条相似序列均使用,不但工作量大,同时也降低了筛选的精度和效率。The whole-genome multiple matching index is the number of similar sequences whose expected value meets the preset expected value requirements and whose score meets the preset score value requirements. Since there are many similar sequences, and the evaluation parameters of each similar sequence have different levels, if Each similar sequence is used, which not only requires a large workload, but also reduces the accuracy and efficiency of screening.

为了减少工作量和提高筛选精度,可以根据期望值和得分值确定所需要的相似序列的数量。In order to reduce the workload and improve the screening accuracy, the number of similar sequences required can be determined based on the expected value and score value.

具体地,可以根据期望值和得分值将多条相似序列进行排序,然后筛选一定数量的相似序列。例如,10条相似序列,按照期望值和得分值的分值从高到低将10条相似序列排序,然后提取前5条相似序列,以前5条相似序列为目标相似序列,并以XML的格式输出并保存前5条相似序列。Specifically, multiple similar sequences can be sorted according to the expected value and score value, and then a certain number of similar sequences can be screened. For example, 10 similar sequences, sort the 10 similar sequences from high to low according to the expected value and score value, then extract the top 5 similar sequences, use the top 5 similar sequences as the target similar sequences, and format them in XML format Output and save the first 5 similar sequences.

若全基因组范围内满足特定的得分值与期望值的多匹配序列的数值越高,则后续检测时非特异性扩增的可能性越高,越可能干扰结果的分析。,The higher the value of multiple matching sequences that meet specific score values and expected values across the genome, the higher the possibility of non-specific amplification during subsequent detection, and the more likely it is to interfere with the analysis of the results. ,

子步骤S136、分别计算每条所述参考序列的GC含量值。Sub-step S136: Calculate the GC content value of each reference sequence respectively.

GC含量值是指在DNA的4种碱基中,鸟嘌呤和胞嘧啶所占的比率。具体地,可以计算每条微单倍型(MH)的参考序列的GC含量值,因为该GC含量值对于PCR过程影响较大。The GC content value refers to the ratio of guanine and cytosine among the four bases of DNA. Specifically, the GC content value of the reference sequence of each microhaplotype (MH) can be calculated, because the GC content value has a greater impact on the PCR process.

子步骤S137、按照预设的重复序列特征值从每条所述参考序列中提取短串联重复序列特征。Sub-step S137: Extract short tandem repeat sequence features from each reference sequence according to preset repeat sequence feature values.

具体地,可以查找参考序列中是否包含类似于短串联重复序列(STR)的序列。Specifically, you can find whether the reference sequence contains sequences similar to short tandem repeats (STR).

例如,可以设置基序包含的碱基个数为1-6,重复次数为4次以上,然后将参考序列中的短串联重复序列(STR)抽取出来,得到短串联重复序列。For example, you can set the number of bases contained in the motif to 1-6 and the number of repetitions to more than 4 times, and then extract the short tandem repeat sequences (STR) in the reference sequence to obtain the short tandem repeat sequence.

最后,可以将短串联重复序列和GC含量值进行调整输出为一个新的文件:该文件包括:mh_name,GC%,nuit_number,repeat-number。Finally, the short tandem repeat sequence and GC content values can be adjusted and output into a new file: the file includes: mh_name, GC%, nuit_number, repeat-number.

S14、根据所述序列特征参数从所述N个初选微单倍型筛选得到M个目标微单倍型,其中,M为大于或等于1的正整数,N大于或等于M。S14. Screen the N preliminary micro-haplotypes to obtain M target micro-haplotypes according to the sequence characteristic parameters, where M is a positive integer greater than or equal to 1, and N is greater than or equal to M.

在获取每条初选微单倍型(MH)的序列特征值和GC含量值后,可以根据每条微单倍型(MH)的序列特征值和GC含量值进行判断,若每条微单倍型(MH)的序列特征值满足预设的特征阈值,或GC含量值满足预设的含量阈值,则确定该初选微单倍型为目标微单倍型。After obtaining the sequence characteristic value and GC content value of each primary micro haplotype (MH), it can be judged based on the sequence characteristic value and GC content value of each micro haplotype (MH). If each micro haplotype (MH) If the sequence feature value of the plotype (MH) meets the preset feature threshold, or the GC content value meets the preset content threshold, then the primary micro-haplotype is determined to be the target micro-haplotype.

由于序列特征参数包括GC含量值、重复序列特征和全基因组多匹配指标,其中,步骤S14可以包括以下子步骤:Since the sequence characteristic parameters include GC content values, repeated sequence characteristics and genome-wide multiple matching indicators, step S14 may include the following sub-steps:

子步骤S141、分别判断每个所述初选微单倍型对应的GC含量值是否满足预设的含量值条件,判断每个所述初选微单倍型对应的重复序列特征是否满足预设的目标序列特征条件,以及判断所述全基因组多匹配指标是否满足预设的指标条件。Sub-step S141: Determine whether the GC content value corresponding to each of the preliminary micro-haplotypes meets the preset content value conditions, and determine whether the repeated sequence characteristics corresponding to each of the preliminary micro-haplotypes meet the preset conditions. target sequence characteristic conditions, and determine whether the whole-genome multi-matching index meets the preset index conditions.

子步骤S142、从所述N个初选微单倍型中筛选M个所述初选微单倍型对应的GC含量值满足预设的含量值条件、所述初选微单倍型对应的重复序列特征满足预设的目标序列特征条件和所述全基因组多匹配指标满足预设的指标条件的初选微单倍型,得到M个目标微单倍型。Sub-step S142: Screen M GC content values corresponding to the primary micro-haplotypes from the N primary micro-haplotypes that meet the preset content value conditions, and the GC content values corresponding to the primary micro-haplotypes meet the preset content value conditions. M target micro-haplotypes are obtained from the primary micro-haplotypes whose repetitive sequence characteristics satisfy the preset target sequence characteristic conditions and the whole-genome multi-matching index satisfies the preset index conditions.

具体地,可以分别将每个初选微单倍型的GC含量值、重复序列特征和全基因组多匹配指标分别与对应的预设含量值、预设比较序列特征和预设的数量值作比较,当GC含量值满足预设含量值、重复序列特征满足预设比较序列特征且全基因组多匹配指标满足预设的数量值时,确定这个初选微单倍型为目标序列特征参数。Specifically, the GC content value, repetitive sequence characteristics and whole-genome multi-matching index of each primary microhaplotype can be compared with the corresponding preset content value, preset comparison sequence characteristics and preset quantitative value respectively. , when the GC content value meets the preset content value, the repetitive sequence characteristics meet the preset comparison sequence characteristics, and the whole-genome multi-matching index meets the preset quantitative value, the primary micro-haplotype is determined to be the target sequence characteristic parameter.

为了方便在后续进行法医学参数计算,其中,作为示例的,所述方法还可以包括:In order to facilitate the subsequent calculation of forensic parameters, as an example, the method may also include:

S15、获取分型数据,所述分型数据为包括若干数量人群的单核苷酸多态性标记分型数据。S15. Obtain typing data, which is single nucleotide polymorphism marker typing data including a certain number of people.

S16、按照预设的千人基因组群体来源和样本名称将所述分型数据拆分成多个群体分型数据,其中,每个所述群体分型数据包括每个样本对应的单核苷酸多态性标记分型数据。S16. Split the typing data into multiple population typing data according to the preset Thousand Genomes population source and sample name, where each population typing data includes the single nucleotide corresponding to each sample. Polymorphic marker typing data.

具体地,该分型数据是用户预设采集或下载得到的包含一定人群数量和人体基因组数据的数据。Specifically, the typing data is data collected or downloaded by the user in advance and contains a certain number of people and human genome data.

按照预设的千人基因组群体来源和样本名称将分型数据拆分为若干个群体分型数据。Split the typing data into several population typing data according to the preset 1000 Genomes population source and sample name.

例如,该分型数据是由26个群体来源共2504人组成的数据,可以根据人群来源的不同,将分型数据拆分成26个群体分型数据。For example, this classification data is composed of a total of 2,504 people from 26 groups. The classification data can be split into 26 group classification data according to the different sources of the groups.

在具体实现中,用户可以预先设定包含样本名称与相应群体来源信息的TXT文件,然后按照每条染色体不同,逐个打开千人基因组的个体原始分型数据,接着利用for循环逐个匹配查找,按照TXT中记录的样本名称与相应的样本与群体来源进行对应,最后将分型数据拆分成26个群体分型数据。In the specific implementation, the user can pre-set a TXT file containing the sample name and the corresponding population source information, and then open the individual original typing data of the Thousand Genomes one by one according to each chromosome, and then use a for loop to match and search one by one, according to The sample name recorded in TXT corresponds to the corresponding sample and population source, and finally the typing data is split into 26 population typing data.

另外,由于每个所述群体分型数据包括每个样本对应的单核苷酸多态性标记分型数据,为了方便将上述的群体分型数据与基因组或转录组获取的目标微单倍型进行后续的计算,可以将群体分型数据拆分为CSV格式的文件。In addition, since each of the population typing data includes the single nucleotide polymorphism marker typing data corresponding to each sample, in order to facilitate the combination of the above population typing data with the target microhaplotype obtained from the genome or transcriptome For subsequent calculations, the population typing data can be split into files in CSV format.

为了计算各个基因组的各个人群的常用法医学参数,其中,作为示例的,所述方法还可以包括:In order to calculate common forensic parameters for each population of each genome, as an example, the method may also include:

S17、采用所述目标微单倍型,计算对应的法医学参数,其中,所述法医学参数包括等位基因分型及其频率、杂合度观察值、杂合度期望值、匹配概率、多态信息含量、个体识别概率、三联体非父排除概率、二联体非父排除概率值和有效等位基因数。S17. Use the target microhaplotype to calculate the corresponding forensic parameters, where the forensic parameters include allele classification and frequency, observed heterozygosity value, expected heterozygosity value, matching probability, polymorphic information content, Individual identification probability, triplet non-paternal exclusion probability, doublet non-paternal exclusion probability value and number of effective alleles.

具体地,可以按照上文筛选得到的每个目标微单倍型(MH)所包含的单核苷酸多态性标记(SNP),将各个群体的单核苷酸多态性标记(SNP)分型数据进行组合。若是由千人基因组拆分得到CSV文件,可以将CSV文件包含的个体的所有SNP基因型信息组合成目标MH的分型,(因为每个MH是由若干个相邻的SNP组成),组合成每个个体相应的MH基因型,可以方便后续计算,将MH包含的SNP转化为标准的“ATCG”碱基格式(SNP的原始分型是采用数字命名的,为了查看方便,我们将其转化为ATCG的碱基格式)。Specifically, the single nucleotide polymorphism markers (SNPs) of each population can be divided into Combination of typing data. If a CSV file is obtained by splitting the genome of a thousand people, all the SNP genotype information of the individual contained in the CSV file can be combined into the typing of the target MH (because each MH is composed of several adjacent SNPs), and combined into The corresponding MH genotype of each individual can facilitate subsequent calculations by converting the SNPs contained in MH into the standard "ATCG" base format (the original typing of SNP is named numerically. For the convenience of viewing, we convert it into ATCG base format).

具体地,所述法医学参数可以包括等位基因分型及其频率、杂合度观察值、杂合度期望值、匹配概率、多态信息含量、个体识别概率、三联体非父排除概率、二联体非父排除概率值和有效等位基因数。Specifically, the forensic parameters may include allele typing and frequency, observed heterozygosity value, expected heterozygosity value, matching probability, polymorphic information content, individual identification probability, triplet non-parent exclusion probability, doublet non-parental probability. Parent exclusion probability value and number of valid alleles.

具体地,杂合度观察值h=样本中杂合子的数目/样本中个体总数;Specifically, the heterozygosity observation value h = the number of heterozygotes in the sample/the total number of individuals in the sample;

杂合度期望值其中,n为样本中所有等位基因总数,k为等位基因或者单倍型种类的数目,pi为样本第i个等位基因或者单倍型的频率;expected value of heterozygosity Among them, n is the total number of all alleles in the sample, k is the number of alleles or haplotypes, and p i is the frequency of the i-th allele or haplotype in the sample;

匹配概率其中,n为某一遗传标记的基因型数目,pi为该群体第i个基因型的频率;Match probability Among them, n is the number of genotypes of a certain genetic marker, and p i is the frequency of the i-th genotype in the population;

多态信息含量其中,n为样本中所有等位基因总数,pi是第i个等位基因的频率;polymorphic information content Among them, n is the total number of all alleles in the sample, and p i is the frequency of the i-th allele;

个体识别概率其中,n为某一遗传标记的基因型数目,pi为该群体第i个基因型的频率;individual identification probability Among them, n is the number of genotypes of a certain genetic marker, and p i is the frequency of the i-th genotype in the population;

三联体非父排除概率其中,n为样本中所有等位基因总数,pi与pj分别为第i和j个等位基因的频率;Triplet non-parent exclusion probability Among them, n is the total number of all alleles in the sample, p i and p j are the frequencies of the i and j alleles respectively;

二联体非父排除概率其中,n为样本中所有等位基因总数,pi与pj分别为第i和j个等位基因的频率;Doublet non-parent exclusion probability Among them, n is the total number of all alleles in the sample, p i and p j are the frequencies of the i and j alleles respectively;

有效等位基因数其中,n为样本中所有等位基因总数,pi第i个等位基因的频率。effective number of alleles Among them, n is the total number of all alleles in the sample, and p i is the frequency of the i-th allele.

最后可以整理、保存并输出计算结果,具体地,可以保存为CSV格式,以方便用户后续进行调用。Finally, the calculation results can be organized, saved and output. Specifically, the calculation results can be saved in CSV format to facilitate subsequent calls by users.

参照图2,示出了本发明一实施例提供的一种微单倍型的筛选方法的操作流程图。Referring to FIG. 2 , an operation flow chart of a microhaplotype screening method provided by an embodiment of the present invention is shown.

在实际操作中,可以分别获取待筛选数据,其中待筛选数据可以包括基因组数据和转录组数据,若是基因组数据,则从基因组数据中读取对应的单核苷酸多态性标记的位置坐标并生成VCF文件,若是转录组数据,则可以将转录组数据转换成对应的BED文件,再从对应的文件中转录成VCF文件;接着粗略筛选得到初选微单倍型;然后查找初选微单倍型对应的fasta参考序列,并生成fasta参考序列对应的序列文件;再接可以查找对应的相似序列,计算fasta参考序列对应的GC含量值和短串联重复序列,根据所设阈值,筛选得到目标微单倍型。再者获取并拆分包含多个人群的分型数据;最后,利用拆分后的群体分型数据计算目标微单倍型对应的法医学参数。In actual operation, the data to be filtered can be obtained separately, where the data to be filtered can include genomic data and transcriptome data. If it is genomic data, the position coordinates of the corresponding single nucleotide polymorphism markers are read from the genomic data and Generate a VCF file. If it is transcriptome data, you can convert the transcriptome data into the corresponding BED file, and then transcribe it into a VCF file from the corresponding file; then roughly screen to obtain the primary micro haplotype; then search for the primary micro haplotype The fasta reference sequence corresponding to the ploidytype, and generate the sequence file corresponding to the fasta reference sequence; then you can search for the corresponding similar sequence, calculate the GC content value and short tandem repeat sequence corresponding to the fasta reference sequence, and filter out the target according to the set threshold Microhaplotype. Then, obtain and split the typing data containing multiple groups; finally, use the split group typing data to calculate the forensic parameters corresponding to the target microhaplotype.

在本实施例中,本发明实施例提供了一种微单倍型的筛选方法,其有益效果在于:本发明可以通过读取单核苷酸多态性标记的位置坐标,基于单核苷酸多态性标记的位置坐标进行粗略的筛选得到初选微单倍型,接着查找初选微单倍型的参考序列,根据参考序列计算序列特征值,最后根据序列特征值筛选目标微单倍型,实现微单倍型的快速筛选的效果。整个过程简单快捷,不但可以缩短筛选时间,提高筛选效率,同时也可以提高筛选的准确率,并且本申请可以实现从基因组与转录组的原始数据筛选评估MH全过程,形成一整套技术方案,使现有技术方案得到整合与提升,大大提高了筛选的实用性和灵活性,同时本申请还提供了一种统一的基因组与转录组来源的MH基因座以及相应的等位基因命名方案,方便不同实验室之间信息交流与计算机快速数据处理。In this embodiment, the embodiment of the present invention provides a microhaplotype screening method, the beneficial effect of which is that: the present invention can read the position coordinates of single nucleotide polymorphism markers based on single nucleotides. The position coordinates of the polymorphic markers are roughly screened to obtain the primary micro-haplotype. Then the reference sequence of the primary micro-haplotype is found, the sequence feature value is calculated based on the reference sequence, and finally the target micro-haplotype is screened based on the sequence feature value. , to achieve the effect of rapid screening of micro-haplotypes. The whole process is simple and fast, which can not only shorten the screening time and improve the screening efficiency, but also improve the accuracy of screening. Moreover, this application can realize the whole process of screening and evaluating MH from the original data of genome and transcriptome, forming a complete set of technical solutions, so that The existing technical solutions have been integrated and improved, greatly improving the practicality and flexibility of screening. At the same time, this application also provides a unified MH locus derived from the genome and transcriptome and the corresponding allele naming scheme to facilitate different Information exchange between laboratories and rapid computer data processing.

本发明实施例还提供了一种微单倍型的筛选装置,参见图3,示出了本发明一实施例提供的一种微单倍型的筛选装置的结构示意图。An embodiment of the present invention also provides a micro-haplotype screening device. Refer to FIG. 3 , which shows a schematic structural diagram of a micro-haplotype screening device provided by an embodiment of the present invention.

其中,作为示例的,所述微单倍型的筛选装置可以包括:Wherein, as an example, the micro-haplotype screening device may include:

读取模块301,用于获取待筛选数据,读取所述待筛选数据中的多行单核苷酸多态性标记的标记坐标;The reading module 301 is used to obtain the data to be screened and read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the data to be screened;

确定模块302,用于根据所述多行单核苷酸多态性标记的标记坐标确定N个初选微单倍型,其中,N为大于或等于1的正整数;The determination module 302 is configured to determine N preliminary micro-haplotypes based on the marker coordinates of the multiple rows of single nucleotide polymorphism markers, where N is a positive integer greater than or equal to 1;

计算模块303,用于分别查找每个所述初选微单倍型对应的参考序列,并分别利用每个所述参考序列计算每个所述初选微单倍型对应的序列特征参数;The calculation module 303 is configured to separately search for the reference sequence corresponding to each of the preliminary micro-haplotypes, and use each of the reference sequences to calculate the sequence characteristic parameters corresponding to each of the preliminary micro-haplotypes;

筛选模块304,用于根据所述序列特征参数从所述N个初选微单倍型筛选得到M个目标微单倍型,其中,M为大于或等于1的正整数,N大于或等于M。The screening module 304 is used to screen the N primary micro-haplotypes to obtain M target micro-haplotypes according to the sequence characteristic parameters, where M is a positive integer greater than or equal to 1, and N is greater than or equal to M. .

可选地,所述确定模块还用于:Optionally, the determination module is also used to:

根据预设的参考坐标差值将所述多行单核苷酸多态性标记的标记坐标划分成N组标记坐标集合;Divide the marker coordinates of the multiple rows of single nucleotide polymorphism markers into N sets of marker coordinate sets according to preset reference coordinate differences;

分别将每组所述标记坐标集合所包含的单核苷酸多态性标记存入预设的python字典;Store the single nucleotide polymorphism markers included in each set of marker coordinate sets into a preset python dictionary respectively;

按照预设的存储数量在预设的python字典中分别提取每一组所述标记坐标集合所包含的单核苷酸多态性标记,并将每一组所述标记坐标集合所包含的单核苷酸多态性标记定为一个初选微单倍型,得到N个初选微单倍型。Extract the single nucleotide polymorphism markers included in each set of marker coordinate sets from the preset python dictionary according to the preset storage quantity, and store the single-core polymorphism markers included in each set of marker coordinate sets. The nucleotide polymorphism marker was determined as a primary micro-haplotype, and N primary micro-haplotypes were obtained.

可选地,所述计算模块还用于:Optionally, the computing module is also used to:

分别依据每个所述初选微单倍型的首个单核苷酸多态性标记坐标和末端单核苷酸多态性标记坐标制作序列文件;Create sequence files based on the first single nucleotide polymorphism marker coordinates and the terminal single nucleotide polymorphism marker coordinates of each of the primary microhaplotypes;

将所述序列文件输入至预设的序列查找工具中,查找得到每个所述初选微单倍型对应的参考序列。Input the sequence file into a preset sequence search tool to search for the reference sequence corresponding to each of the preliminary micro-haplotypes.

可选地,所述序列特征参数包括GC含量值、重复序列特征和全基因组多匹配指标;Optionally, the sequence characteristic parameters include GC content values, repeated sequence characteristics and genome-wide multiple matching indicators;

所述计算模块还用于:The computing module is also used to:

分别以每条所述参考序列为模板,通过BLAST分析从预设的全基因组数据中查找多条相似序列,计算每条相似序列的评测参数,所述评测参数包括期望值和得分值;Using each of the reference sequences as a template, search for multiple similar sequences from the preset whole-genome data through BLAST analysis, and calculate the evaluation parameters of each similar sequence, where the evaluation parameters include expected values and score values;

基于所述期望值和所述得分值统计查找得到的所述相似序列的相似数量,以所述相似数量为全基因组多匹配指标;The similarity number of the similar sequences obtained by statistical search based on the expected value and the score value, and the similarity number is the whole-genome multi-matching index;

分别计算每条所述参考序列的GC含量值;Calculate the GC content value of each reference sequence separately;

按照预设的重复序列特征值从每条所述参考序列中提取短串联重复序列特征。Short tandem repeat sequence features are extracted from each reference sequence according to preset repeat sequence feature values.

可选地,所述待筛选数据包括基因组数据和转录组数据;Optionally, the data to be screened includes genomic data and transcriptome data;

所述读取模块还用于:The reading module is also used to:

当所述待筛选数据为基因组数据时,则读取所述基因组数据中的多行单核苷酸多态性标记的标记坐标;When the data to be filtered is genomic data, read the marker coordinates of multiple rows of single nucleotide polymorphism markers in the genomic data;

当所述待筛选数据为转录组数据时,获取所述转录组数据所包含的染色体的起始坐标和终止坐标,并以所述起始坐标至所述终止坐标的间距作为坐标区间,从所述坐标区间中筛选坐标值在所述坐标区间内的多个目标单核苷酸多态性标记的标记坐标。When the data to be filtered is transcriptome data, the start coordinates and the end coordinates of the chromosomes contained in the transcriptome data are obtained, and the distance from the start coordinate to the end coordinate is used as the coordinate interval. In the coordinate interval, the marker coordinates of multiple target single nucleotide polymorphism markers whose coordinate values are within the coordinate interval are screened.

可选地,所述筛选模块还用于:Optionally, the screening module is also used to:

分别判断每个所述初选微单倍型对应的GC含量值是否满足预设的含量值条件,判断每个所述初选微单倍型对应的重复序列特征是否满足预设的目标序列特征条件,以及判断所述全基因组多匹配指标是否满足预设的指标条件;Determine whether the GC content value corresponding to each of the preliminary micro-haplotypes meets the preset content value conditions, and determine whether the repeated sequence characteristics corresponding to each of the preliminary micro-haplotypes meet the preset target sequence characteristics. conditions, and determining whether the whole-genome multi-matching index meets the preset index conditions;

从所述N个初选微单倍型中筛选M个所述初选微单倍型对应的GC含量值满足预设的含量值条件、所述初选微单倍型对应的重复序列特征满足预设的目标序列特征条件和所述全基因组多匹配指标满足预设的指标条件的初选微单倍型,得到M个目标微单倍型。The GC content values corresponding to the M preliminary micro haplotypes are selected from the N preliminary micro haplotypes to meet the preset content value conditions, and the repeat sequence characteristics corresponding to the preliminary micro haplotypes satisfy The preset target sequence characteristic conditions and the whole-genome multi-matching index are the primary micro-haplotypes that meet the preset index conditions, and M target micro-haplotypes are obtained.

可选地,所述装置还包括:Optionally, the device also includes:

分型模块,用于获取分型数据,所述分型数据为包括若干数量人群的单核苷酸多态性标记分型数据;A typing module is used to obtain typing data, which is single nucleotide polymorphism marker typing data including a certain number of people;

拆分模块,用于按照预设的千人基因组群体来源和样本名称将所述分型数据拆分成多个群体分型数据,其中,每个所述群体分型数据包括每个样本对应的单核苷酸多态性标记分型数据。A splitting module, configured to split the typing data into multiple population typing data according to the preset Thousand Genomes population source and sample name, wherein each of the population typing data includes each sample corresponding to Single nucleotide polymorphism marker typing data.

可选地,所述装置还包括:Optionally, the device also includes:

法医学参数模块,用于采用所述目标微单倍型,计算对应的法医学参数,其中,所述法医学参数包括等位基因分型及其频率、杂合度观察值、杂合度期望值、匹配概率、多态信息含量、个体识别概率、三联体非父排除概率、二联体非父排除概率值和有效等位基因数。A forensic parameter module is used to calculate corresponding forensic parameters using the target microhaplotype, where the forensic parameters include allele classification and frequency, observed heterozygosity, expected heterozygosity, matching probability, multiple State information content, individual identification probability, triplet non-parent exclusion probability, doublet non-parent exclusion probability value and number of effective alleles.

可选地,在根据所述多个单核苷酸多态性标记坐标确定初选微单倍型的步骤后,所述装置还包括:Optionally, after the step of determining the primary microhaplotype based on the multiple single nucleotide polymorphism marker coordinates, the device further includes:

命名模块,用于对每个所述初选微单倍型进行命名Naming module for naming each of the primary microhaplotypes

进一步的,本申请实施例还提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述实施例所述的微单倍型的筛选方法。Furthermore, embodiments of the present application also provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above implementation is implemented. The micro-haplotype screening method described in the example.

进一步的,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如上述实施例所述的微单倍型的筛选方法。Further, embodiments of the present application also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause the computer to execute the steps described in the above embodiments. Screening methods for microhaplotypes.

以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。The above is the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, several improvements and modifications can be made without departing from the principles of the present invention. These improvements and modifications are also regarded as It is the protection scope of the present invention.

Claims (7)

1. A method for screening a microsloid comprising:
Acquiring data to be screened, and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
searching a reference sequence corresponding to each primary micro-haplotype, and calculating sequence characteristic parameters corresponding to each primary micro-haplotype by using each reference sequence;
screening the N primary selected micro-haplotypes according to the sequence characteristic parameters to obtain M target micro-haplotypes, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M;
the sequence characteristic parameters comprise GC content values, repeated sequence characteristics and genome-wide multiple matching indexes;
the calculating the sequence characteristic value corresponding to each primary micro-haplotype by using each reference sequence comprises the following steps:
searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and scoring values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
Respectively calculating GC content values of each reference sequence;
extracting short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value;
the data to be screened comprises genome data and transcriptome data;
the reading of the mark coordinates of the multi-row single nucleotide polymorphism marks in the data to be screened comprises the following steps:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, and taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval;
the screening of M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters comprises the following steps:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
And screening M preliminary micro-haplotypes of which the GC content values corresponding to the preliminary micro-haplotypes meet preset content value conditions, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet preset target sequence characteristic conditions and the genome-wide multi-match indexes meet preset index conditions from the N preliminary micro-haplotypes to obtain M target micro-haplotypes.
2. The method according to claim 1, wherein determining N prime microscales based on the marker coordinates of the plurality of rows of single nucleotide polymorphism markers comprises:
dividing the marking coordinates of the multi-row single nucleotide polymorphism marks into N groups of marking coordinate sets according to a preset reference coordinate difference value;
storing the single nucleotide polymorphism markers contained in each set of marker coordinate sets into a preset python dictionary respectively;
and respectively extracting the single nucleotide polymorphism markers contained in each group of mark coordinate sets from a preset python dictionary according to the preset storage quantity, and setting the single nucleotide polymorphism markers contained in each group of mark coordinate sets as a primary micro-haplotype to obtain N primary micro-haplotypes.
3. The method according to claim 1, wherein the searching for the reference sequence corresponding to each of the preliminary micro-haplotypes comprises:
respectively preparing a sequence file according to the first single nucleotide polymorphism mark coordinate and the tail end single nucleotide polymorphism mark coordinate of each primary selection microsloid;
and inputting the sequence file into a preset sequence searching tool, and searching to obtain a reference sequence corresponding to each primary micro-haplotype.
4. The method of screening for micro-haplotypes according to claim 1, further comprising:
the method comprises the steps of obtaining typing data, wherein the typing data comprise single nucleotide polymorphism marking typing data of a plurality of crowds;
splitting the typing data into a plurality of group typing data according to preset thousand-person genome group sources and sample names, wherein each group typing data comprises single nucleotide polymorphism marking typing data corresponding to each sample.
5. The method of screening for micro-haplotypes according to claim 1, further comprising:
and calculating corresponding forensic parameters by adopting the target micro-haplotype, wherein the forensic parameters comprise allele typing and frequency thereof, heterozygosity observation value, heterozygosity expected value, matching probability, polymorphism information content, individual identification probability, triplet non-father exclusion probability, duplex non-father exclusion probability value and effective allele factors.
6. The method of claim 1, wherein after the step of determining N prime microscales based on the marker coordinates of the plurality of rows of single nucleotide polymorphism markers, the method further comprises:
each of the primary microsloids was named.
7. A screening apparatus for microsloids, said apparatus comprising:
the reading module is used for acquiring data to be screened and reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the data to be screened;
the determining module is used for determining N primary micro-haplotypes according to the marking coordinates of the multi-line single nucleotide polymorphism marks, wherein N is a positive integer greater than or equal to 1;
the calculation module is used for searching the reference sequence corresponding to each primary micro-haplotype and calculating the sequence characteristic parameter corresponding to each primary micro-haplotype by using each reference sequence;
the screening module is used for screening M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters, wherein M is a positive integer greater than or equal to 1, and N is greater than or equal to M;
the sequence characteristic parameters comprise GC content values, repeated sequence characteristics and genome-wide multiple matching indexes;
The calculating the sequence characteristic value corresponding to each primary micro-haplotype by using each reference sequence comprises the following steps:
searching a plurality of similar sequences from preset whole genome data by BLAST analysis by taking each reference sequence as a template, and calculating evaluation parameters of each similar sequence, wherein the evaluation parameters comprise expected values and scoring values;
counting the number of the similar sequences obtained by searching based on the expected value and the score value, and taking the number of the similar sequences as a whole genome multiple matching index;
respectively calculating GC content values of each reference sequence;
extracting short tandem repeat sequence features from each reference sequence according to a preset repeat sequence feature value;
the data to be screened comprises genome data and transcriptome data;
the reading of the mark coordinates of the multi-row single nucleotide polymorphism marks in the data to be screened comprises the following steps:
when the data to be screened is genome data, reading mark coordinates of a plurality of lines of single nucleotide polymorphism marks in the genome data;
when the data to be screened is transcriptome data, acquiring a start coordinate and a stop coordinate of a chromosome contained in the transcriptome data, and taking a distance from the start coordinate to the stop coordinate as a coordinate interval, and screening mark coordinates of a plurality of target single nucleotide polymorphism marks with the coordinate values in the coordinate interval from the coordinate interval;
The screening of M target micro-haplotypes from the N primary micro-haplotypes according to the sequence characteristic parameters comprises the following steps:
judging whether the GC content value corresponding to each primary micro-haplotype meets the preset content value condition, judging whether the repeated sequence characteristic corresponding to each primary micro-haplotype meets the preset target sequence characteristic condition, and judging whether the genome-wide multi-matching index meets the preset index condition;
and screening M preliminary micro-haplotypes of which the GC content values corresponding to the preliminary micro-haplotypes meet preset content value conditions, the repeated sequence characteristics corresponding to the preliminary micro-haplotypes meet preset target sequence characteristic conditions and the genome-wide multi-match indexes meet preset index conditions from the N preliminary micro-haplotypes to obtain M target micro-haplotypes.
CN202110654476.7A 2021-06-11 2021-06-11 Screening method and device for micro haplotypes Active CN113284552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110654476.7A CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110654476.7A CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Publications (2)

Publication Number Publication Date
CN113284552A CN113284552A (en) 2021-08-20
CN113284552B true CN113284552B (en) 2023-10-03

Family

ID=77284418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110654476.7A Active CN113284552B (en) 2021-06-11 2021-06-11 Screening method and device for micro haplotypes

Country Status (1)

Country Link
CN (1) CN113284552B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862177A (en) * 2017-07-12 2018-03-30 中国水产科学研究院淡水渔业研究中心 A kind of construction method for the SNP molecular labeling collection for distinguishing carp colony
CN109346130A (en) * 2018-10-24 2019-02-15 中国科学院水生生物研究所 A method for obtaining microhaplotypes and their typing directly from whole-genome resequencing data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6909971B2 (en) * 2001-06-08 2005-06-21 Licentia Oy Method for gene mapping from chromosome and phenotype data
US20200168299A1 (en) * 2017-07-28 2020-05-28 Pioneer Hi-Bred International, Inc. Systems and methods for targeted genome editing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862177A (en) * 2017-07-12 2018-03-30 中国水产科学研究院淡水渔业研究中心 A kind of construction method for the SNP molecular labeling collection for distinguishing carp colony
CN109346130A (en) * 2018-10-24 2019-02-15 中国科学院水生生物研究所 A method for obtaining microhaplotypes and their typing directly from whole-genome resequencing data
CN112233724A (en) * 2020-10-16 2021-01-15 深圳市盛景基因生物科技有限公司 Ancestral polymorphism prediction method based on big data artificial intelligence algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
遗传标记微单倍型在法医学中的研究进展;陈鹏;朱镜;姜又菁;陈丹;王惠;毛炯;梁伟波;张林;;中国法医学杂志(05);第54-57页 *

Also Published As

Publication number Publication date
CN113284552A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN110832510B (en) Variant classifier based on deep learning
van Dijk et al. Genomics in the long-read sequencing era
AU2023282274A1 (en) Variant classifier based on deep neural networks
US9633166B2 (en) Sequence-centric scientific information management
CN106068330B (en) Systems and methods for using known alleles in read mapping
US20190318806A1 (en) Variant Classifier Based on Deep Neural Networks
CN115631789B (en) A Pan-Genome-Based Population Joint Variation Detection Method
JP7054133B2 (en) Sequence analysis method, sequence analysis device, reference sequence generation method, reference sequence generator, program, and recording medium
KR20220136462A (en) Deep Learning-Based Framework For Identifying Sequence Patterns That Cause Sequence-Specific Errors (SSES)
CN115458052A (en) Gene mutation analysis method, equipment and storage medium based on first generation sequencing
US20190139628A1 (en) Machine learning techniques for analysis of structural variants
CN115965294A (en) A River Ecological Health Evaluation Method Based on Machine Learning and Environmental DNA
CN110219054A (en) A kind of nucleic acid sequencing library and its construction method
CN113284552B (en) Screening method and device for micro haplotypes
EP3871222B1 (en) Vector-based haplotype identification
CN107977550A (en) A kind of quick analysis Disease-causing gene algorithm based on compression
CN107679365A (en) The method of surname is efficiently inferred based on Y chromosome molecular labeling
CN112885407A (en) Second-generation sequencing-based micro-haplotype detection and typing system and method
Fletcher et al. AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data
Wendt Bioinformatic tools for interrogating DNA recovered from human skeletal remains
CN118116451A (en) A method and application of InDel background molecular marker design based on resequencing
US20240404624A1 (en) Structural variant alignment and variant calling by utilizing a structural-variant reference genome
US8494785B1 (en) Molecular standards for microbial pathogens
de Sena Brandine et al. Increased accuracy and speed in whole genome bisulfite read mapping using a two-letter alphabet
WO2023244983A1 (en) Sequence process validation methods and compositions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant