CN111881324B

CN111881324B - High-throughput sequencing data general storage format structure, construction method and application thereof

Info

Publication number: CN111881324B
Application number: CN202010748559.8A
Authority: CN
Inventors: 郁春江; 沈百荣
Original assignee: Suzhou Industrial Park Institute of Services Outsourcing
Current assignee: Suzhou Industrial Park Institute of Services Outsourcing
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2023-12-15
Anticipated expiration: 2040-07-30
Also published as: CN111881324A

Abstract

The invention provides a general storage format structure of high-throughput sequencing data, a construction method and application thereof. By the invention, different types of high-throughput sequencing data can be stored in one format, so that the defect that the interoperability of the data is influenced due to the diversity of the data formats is overcome. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.

Description

Common storage format structure of high-throughput sequencing data, its construction method and application

技术领域Technical field

本发明属于生物信息处理技术领域，涉及高通量测序数据通用存储格式结构、其构建方法及应用。The invention belongs to the technical field of biological information processing and relates to a universal storage format structure of high-throughput sequencing data, its construction method and application.

背景技术Background technique

随着高通量测序技术的飞速发展，在测序时所使用的仪器或供应商、测序原理、以及开发背景或目标的不同，例如可读性、集成性、空间节省以及其他因素，产生了越来越多类型的测序数据。为了分析这些数据，人们设计了大量的分析软件，但这些分析软件中的大多数定义了它们自己的数据存储格式(S.Pabinger,A.Dander,M.Fischer,R.Snajder,M.Sperk,M.Efremova,B.Krabichler,M.R.Speicher,J.Zschocke,and Z.Trajanoski,“Asurvey oftools for variant analysis ofnext-generation genome sequencingdata,”BriefBioinform,vol.15,no.2,pp.256-78,Mar,2014)。例如，BAM/FASTQ/QSEQ、BAM/HDF5/FASTQ和BAM/SFF/FASTQ分别是Illumina、PacBio和Ion Torrent测序仪所能处理的文件格式。以上原因造成了数据格式的多样性。With the rapid development of high-throughput sequencing technology, differences in the instruments or suppliers used in sequencing, sequencing principles, and development backgrounds or goals, such as readability, integration, space saving, and other factors, have resulted in more and more problems. coming from more and more types of sequencing data. In order to analyze these data, people have designed a large number of analysis software, but most of these analysis software define their own data storage format (S.Pabinger, A.Dander, M.Fischer, R.Snajder, M.Sperk, M.Efremova, B.Krabichler, M.R.Speicher, J.Zschocke, and Z.Trajanoski, "Asurvey of tools for variant analysis of next-generation genome sequencing data," BriefBioinform, vol.15, no.2, pp.256-78, Mar , 2014). For example, BAM/FASTQ/QSEQ, BAM/HDF5/FASTQ and BAM/SFF/FASTQ are file formats that can be processed by Illumina, PacBio and Ion Torrent sequencers respectively. The above reasons have resulted in the diversity of data formats.

数据互操作性是大数据分析中一个关键环节，已经有许多格式转换工具被相继开发出来，它们主要的功能是将高通量测序数据从一种格式转换为另一种格式(H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G.Abecasis,andR.Durbin,“The Sequence Alignment/Map format and SAMtools,”Bioinformatics,vol.25,no.16,pp.2078-9,Aug 15,2009.；M.R.Breese,and Y.Liu,“NGSUtils:a softwaresuite for analyzing and manipulating next-generation sequencing datasets,”Bioinformatics,vol.29,no.4,pp.494-6,Feb15,2013.)。Data interoperability is a key link in big data analysis. Many format conversion tools have been developed one after another. Their main function is to convert high-throughput sequencing data from one format to another (H.Li, B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G.Abecasis,andR.Durbin, "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol.25, no .16, pp.2078-9, Aug 15, 2009.; M.R. Breese, and Y. Liu, "NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets," Bioinformatics, vol.29, no.4, pp .494-6, Feb15, 2013.).

然而，它们大多是针对特定的和有限的格式开发的，并且，格式转换不仅会丢失信息，还需要耗费很大的计算资源。如果碰到一种格式还没有现成的工具转成所需的格式时，还得等待别人开发出来，或者自己写程序实现一个，这对于非专业程序开发人员来说不是一件容易的事情。However, most of them are developed for specific and limited formats, and format conversion not only loses information but also requires a lot of computing resources. If there is no ready-made tool to convert a format into the required format, you have to wait for others to develop it, or write a program to implement it yourself. This is not an easy task for non-professional program developers.

发明内容Contents of the invention

针对上述技术问题，本发明的目的是提供高通量测序数据的通用存储格式结构、其构建方法及应用。不同类型的高通量测序数据都可以用一种格式进行存储，这样克服了因数据格式的多样性而影响数据的互操作性。同时通用格式是结构化的，相比文本的、无结构化数据，在数据过滤、提取时更容易快捷。In view of the above technical problems, the purpose of the present invention is to provide a universal storage format structure of high-throughput sequencing data, its construction method and application. Different types of high-throughput sequencing data can be stored in one format, thus overcoming the impact of the diversity of data formats on data interoperability. At the same time, the general format is structured, which makes it easier and faster to filter and extract data than text and unstructured data.

为了实现上述技术目的，本发明采用以下技术方案：In order to achieve the above technical objectives, the present invention adopts the following technical solutions:

本发明提供了高通量测序数据通用存储格式结构，包括四个组件：头组件、序列组件、质量分数组件、序列信息组件，其中：The present invention provides a universal storage format structure for high-throughput sequencing data, including four components: a header component, a sequence component, a quality score component, and a sequence information component, wherein:

所述头组件，用于存储文件的头部描述信息；The header component is used to store the header description information of the file;

所述序列组件，用于存储序列信息，所述序列信息为碱基序列或者存放碱基序列的文件路径；The sequence component is used to store sequence information, where the sequence information is a base sequence or a file path for storing the base sequence;

所述质量分数组件，用于存储序列的质量分数，所述序列质量分数为质量分数字符串或者存放质量分数的文件路径；The quality score component is used to store the quality score of the sequence, where the sequence quality score is a quality score string or a file path that stores the quality score;

所述序列信息组件，用于存储序列的记录和特征。The sequence information component is used to store records and characteristics of the sequence.

优选地，高通量测序数据通用存储格式结构是基于XML和XML Schema技术设计而成。Preferably, the universal storage format structure of high-throughput sequencing data is designed based on XML and XML Schema technology.

优选地，所述头组件包含子元素meta_info，在meta_info中包含name和value属性；所述序列组件中包含一个或多个seq子元素来表示序列，每个seq子元素都有一个唯一的标识，用于序列信息组件；所述质量分数组件中包含一个或多个qual子元素来表示序列质量分数，每个qual子元素都有一个唯一的标识，用于序列信息组件；所述序列信息组件中包含一个或多个seqinfo子元素，一个seqinfo子元素表示一条序列记录。Preferably, the header component includes the sub-element meta_info, which contains name and value attributes; the sequence component includes one or more seq sub-elements to represent the sequence, and each seq sub-element has a unique identifier. Used for the sequence information component; the quality score component contains one or more qual sub-elements to represent the sequence quality score. Each qual sub-element has a unique identifier and is used for the sequence information component; in the sequence information component Contains one or more seqinfo sub-elements, one seqinfo sub-element represents a sequence record.

本发明还提供了基于上述高通量测序数据通用存储格式结构的编辑工具，用于创建及编辑高通量测序数据通用存储格式文件，以及在NGS文件和NGSGF文件之间进行格式转换。The present invention also provides an editing tool based on the above-mentioned universal storage format structure of high-throughput sequencing data, for creating and editing universal storage format files for high-throughput sequencing data, and performing format conversion between NGS files and NGSGF files.

优选地，上述编辑工具是通过NetBeans IDE 10.0用Java编写的。Preferably, the above editing tool is written in Java through NetBeans IDE 10.0.

优选地，上述编辑工具通过GUI和命令行调用运行相应的操作。Preferably, the above editing tool runs corresponding operations through GUI and command line calls.

优选地，上述编辑工具支持转换的格式包括FASTA、FASTQ、SAM、VCF、CAF五种格式。Preferably, the formats supported by the above-mentioned editing tool for conversion include five formats: FASTA, FASTQ, SAM, VCF, and CAF.

本发明还提供了上述高通量测序数据通用存储格式结构的构建方法，包括以下步骤：The present invention also provides a method for constructing the universal storage format structure of the above-mentioned high-throughput sequencing data, which includes the following steps:

1)收集已有高通量数据格式并将其分为五种类型：序列和质量分数格式、比对格式、组装格式、突变格式、注释和可视化格式；1) Collect existing high-throughput data formats and classify them into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;

2)对每种格式的具体规格进行分析，寻找共性和特性的内容；2) Analyze the specific specifications of each format to find commonalities and characteristics;

3)基于共性和特性的内容设计通用的存储格式结构。3) Design a universal storage format structure based on commonalities and characteristics of content.

优选地，所述序列和质量分数格式包括Fasta/CSFASTA、Fastq/CSFASTQ、Qseq、SCARF、QUAL、2bit/nib、SFF格式；所述比对格式包括：SAM、BAM、bowtie、maq格式；所述组装格式包括ACE、AFG、CAF格式；所述突变格式包括GVF、pileup、VCF格式；所述注释和可视化格式包括BED、bigBED、Wig、bigWig、BedGraph、GFF/GTF格式。Preferably, the sequence and quality score formats include Fasta/CSFASTA, Fastq/CSFASTQ, Qseq, SCARF, QUAL, 2bit/nib, and SFF formats; the alignment formats include: SAM, BAM, bowtie, and maq formats; Assembly formats include ACE, AFG, and CAF formats; mutation formats include GVF, pileup, and VCF formats; and annotation and visualization formats include BED, bigBED, Wig, bigWig, BedGraph, and GFF/GTF formats.

本发明又提供了上述高通量测序数据通用存储格式结构在表示、存储、编辑及转换高通量测序数据中的应用。The present invention also provides the application of the above universal storage format structure of high-throughput sequencing data in representing, storing, editing and converting high-throughput sequencing data.

本发明采用基于XML和XML Schema技术设计了格式结构，能够存储目前众多不同类型的高通量测序数据，该格式结构规定了数据存储的结构，具体的存储内容根据序列信息是可变的，使得该格式不仅可以存储现有格式的数据，还可以应对新出现的数据格式。The present invention adopts a format structure based on XML and XML Schema technology, which can store many different types of current high-throughput sequencing data. The format structure specifies the structure of data storage, and the specific storage content is variable according to the sequence information, so that This format can not only store data in existing formats, but also cope with emerging data formats.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

首先，本发明通用存储格式结构使用组件结构，将序列和描述信息分为四部分，使格式结构清晰并有很好的自描述性，便于将来扩展。First of all, the universal storage format structure of the present invention uses a component structure to divide the sequence and description information into four parts, making the format structure clear and self-descriptive to facilitate future expansion.

其次，本发明通用存储格式结构将参考思想引入到生物数据格式中，这是一种广泛应用于计算机科学的技术。在该通用存储格式中，采用了链接的形式，如果内容相同或相似，不同的序列信息可以引用相同的序列或质量分数。它可以避免存储重复的内容。Secondly, the universal storage format structure of the present invention introduces reference ideas into biological data formats, which is a technology widely used in computer science. In this common storage format, a link is used, and different sequence information can refer to the same sequence or quality score if the content is the same or similar. It avoids storing duplicate content.

第三，本发明通用存储格式结构充分利用了目前流行的NGS数据格式的优点，可以存储大部分的生物序列信息。此外，该通用存储格式结构继承了XML的灵活性和可扩展性。由于NGS技术的飞速发展，新的概念和分析工具不断涌现，旧的数据格式很难适应当前的需求。本发明通用存储格式结构的可扩展性克服了特定数据格式的问题，其灵活性能够适应未来发展的需要。Third, the universal storage format structure of the present invention fully utilizes the advantages of the currently popular NGS data format and can store most biological sequence information. In addition, this universal storage format structure inherits the flexibility and extensibility of XML. Due to the rapid development of NGS technology, new concepts and analysis tools are constantly emerging, and old data formats are difficult to adapt to current needs. The scalability of the universal storage format structure of the present invention overcomes the problems of specific data formats, and its flexibility can adapt to the needs of future development.

最后，本发明通用存储格式结构的可读性很好，因此它可以很容易地被计算机程序处理，并且对人类来说可读性更好，更易于理解所存储的内容。这种优势可归因于XML的树结构特性。Finally, the universal storage format structure of the present invention is very readable, so it can be easily processed by computer programs, and is more readable for humans, making it easier to understand the stored content. This advantage can be attributed to the tree-structured nature of XML.

附图说明Description of drawings

图1显示现有技术中常用的26种高通量数据存储格式。Figure 1 shows 26 high-throughput data storage formats commonly used in the prior art.

图2显示本发明总体技术架构。Figure 2 shows the overall technical architecture of the present invention.

图3显示本发明NGSGF的总体格式结构。Figure 3 shows the overall format structure of the NGSGF of the present invention.

图4显示本发明NGSGF头组件的格式结构。Figure 4 shows the format structure of the NGSGF head assembly of the present invention.

图5显示本发明NGSGF序列组件的格式结构。Figure 5 shows the format structure of the NGSGF sequence component of the present invention.

图6显示本发明NGSGF质量分数组件的格式结构。Figure 6 shows the format structure of the NGSGF quality score component of the present invention.

图7显示本发明NGSGF序列信息组件的格式结构。Figure 7 shows the format structure of the NGSGF sequence information component of the present invention.

图8显示本发明实施例2中NGSGFEditor的用户界面截图。Figure 8 shows a screenshot of the user interface of NGSGFEditor in Embodiment 2 of the present invention.

图9显示本发明实施例2中NGSGFEditor两个项目截图。Figure 9 shows screenshots of two NGSGFEditor projects in Embodiment 2 of the present invention.

图10显示本发明实施例2“步骤1：新建NGSGF文件”界面截图。Figure 10 shows a screenshot of the "Step 1: Create a new NGSGF file" interface in Embodiment 2 of the present invention.

图11显示本发明实施例2“步骤2：添加序列”界面截图。Figure 11 shows a screenshot of the "Step 2: Add Sequence" interface in Embodiment 2 of the present invention.

图12显示本发明实施例2“步骤3：添加质量分数”界面截图。Figure 12 shows a screenshot of the "Step 3: Add quality score" interface in Embodiment 2 of the present invention.

图13显示本发明实施例2“步骤4：添加序列信息”界面截图。Figure 13 shows a screenshot of the "Step 4: Add sequence information" interface in Embodiment 2 of the present invention.

图14显示本发明实施例2“步骤5：保存NGSGF文件”界面截图。Figure 14 shows a screenshot of the "Step 5: Save NGSGF file" interface in Embodiment 2 of the present invention.

图15显示本发明实施例3通过NGSGFEditor GUI转换FASTQ与NGSGF格式文件的界面截图。Figure 15 shows a screenshot of the interface for converting FASTQ and NGSGF format files through the NGSGFEditor GUI in Embodiment 3 of the present invention.

图16显示本发明实施例3中输入“java-jarNGSGFEditor.jar-h”显示帮助的界面截图。Figure 16 shows a screenshot of the interface for displaying help by inputting "java-jarNGSGFEditor.jar-h" in Embodiment 3 of the present invention.

图17显示本发明实施例3使用NGSGFEditor命令行转换SAM与NGSGF格式文件的界面截图。Figure 17 shows a screenshot of the interface for converting SAM and NGSGF format files using the NGSGFEditor command line in Embodiment 3 of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例和附图，对本发明的技术方案进行清楚、完整地描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings.

为了解决目前NGS数据的兼容性问题，我们开发了一种新的基于XML的通用存储格式，以下简称为NGSGF，它可以满足大多数NGS数据类型。NGSGF以可扩展标记语言(XML)为基础，XML被广泛应用于Internet上的数据存储以及数学和生物学等领域。NGSGF用于描述NGS技术产生的数据，NGS使用的不同类型的信息被集成到NGSGF中，例如对齐、装配和注释信息。由于XML的高度可扩展性，因此很容易用新特性扩展NGSGF。In order to solve the current compatibility problem of NGS data, we have developed a new XML-based universal storage format, hereinafter referred to as NGSGF, which can satisfy most NGS data types. NGSGF is based on Extensible Markup Language (XML), which is widely used in data storage on the Internet and in fields such as mathematics and biology. NGSGF is used to describe data generated by NGS technology. Different types of information used by NGS are integrated into NGSGF, such as alignment, assembly and annotation information. Because XML is highly extensible, it is easy to extend NGSGF with new features.

本发明首先研究了目前高通量测序数据领域所采用的数据存储格式。共收集了26种常用的高通量数据格式，把它们分为五种类型：序列和质量分数格式(Sequence orquality score)、比对格式(Alignment)、组装格式(Assembly)、突变格式(Variant)、序列注释和可视化格式(Sequence annotation&visualization)，其中，序列和质量分数格式可包括Fasta/CSFASTA、Fastq/CSFASTQ、Qseq、SCARF、QUAL、2bit/nib、SFF格式；比对格式可包括：SAM、BAM、bowtie、maq格式；组装格式可包括ACE、AFG、CAF格式；突变格式可包括GVF、pileup、VCF格式；注释和可视化格式可包括BED、bigBED、Wig、bigWig、BedGraph、GFF/GTF格式，如图1所示。The present invention first studies the data storage format currently used in the field of high-throughput sequencing data. A total of 26 commonly used high-throughput data formats are collected and divided into five types: sequence and quality score format (Sequence orquality score), alignment format (Alignment), assembly format (Assembly), and mutation format (Variant) , Sequence annotation & visualization formats (Sequence annotation&visualization), wherein sequence and quality score formats can include Fasta/CSFASTA, Fastq/CSFASTQ, Qseq, SCARF, QUAL, 2bit/nib, SFF formats; alignment formats can include: SAM, BAM , bowtie, maq formats; assembly formats can include ACE, AFG, CAF formats; mutation formats can include GVF, pileup, VCF formats; annotation and visualization formats can include BED, bigBED, Wig, bigWig, BedGraph, GFF/GTF formats, such as As shown in Figure 1.

然后对每种格式的具体规格进行分析，主要分析格式所存储的内容和内容的组织形式。掌握每种格式的具体规格之后，从中寻找共性和特性的内容。基于共性和特性的内容设计通用的存储格式。如图3所示，本发明新提出的格式NGSGF包含四个组件：头组件(header_lines)、序列组件(list_of_seqs)、质量分数组件(list_of_quals)、序列信息组件(list_of_seqinfos)，其中，头组件是存放头部描述信息的组件，大部分现存的NGS文件格式包含头部信息来描述所存储的内容。如图4所示，在header_lines中包含子元素meta_info。在meta_info中包含name和value属性，用来存储NGS的头部描述信息；序列组件是存放序列信息的组件，序列信息为碱基序列或者存放碱基序列的文件路径。存放文件路径使NGSGF能够存储大的序列文件。如图5所示，在list_of_seqs组件中包含一个或多个seq子元素来表示序列。每个seq子元素都有一个唯一的标识，用于list_of_seqinfos组件中；质量分数组件是存放序列质量分数的组件，序列质量分数为质量分数字符串或者存放质量分数的文件路径。存放文件路径使NGSGF能够存储大的质量分数文件。如图6所示，在list_of_quals组件中包含一个或多个qual子元素来表示序列质量分数，每个qual子元素都有一个唯一的标识，用于list_of_seqinfos组件中；序列信息组件是存放序列记录和特征的组件，如图7所示，在list_of_seqinfos组件中包含一个或多个seqinfo子元素，一个seqinfo子元素表示一条序列记录。通常，NGSGF序列记录存放在该组件中。该通用的存储格式能够存储以上26种格式所存储的内容。Then the specific specifications of each format are analyzed, mainly analyzing the content stored in the format and the organization form of the content. Once you understand the specific specifications of each format, look for commonalities and unique features. Design a common storage format based on commonalities and characteristics of content. As shown in Figure 3, the format NGSGF newly proposed by the present invention includes four components: header component (header_lines), sequence component (list_of_seqs), quality score component (list_of_quals), and sequence information component (list_of_seqinfos). Among them, the header component is to store A component of header description information. Most existing NGS file formats contain header information to describe the stored content. As shown in Figure 4, the sub-element meta_info is included in header_lines. Meta_info contains name and value attributes, which are used to store NGS header description information; the sequence component is a component that stores sequence information, and the sequence information is the base sequence or the file path to store the base sequence. Storing file paths enables NGSGF to store large sequence files. As shown in Figure 5, one or more seq sub-elements are included in the list_of_seqs component to represent the sequence. Each seq sub-element has a unique identifier, which is used in the list_of_seqinfos component; the quality score component is a component that stores the sequence quality score, and the sequence quality score is a quality score string or a file path that stores the quality score. The storage file path enables NGSGF to store large quality score files. As shown in Figure 6, the list_of_quals component contains one or more qual sub-elements to represent the sequence quality score. Each qual sub-element has a unique identifier and is used in the list_of_seqinfos component; the sequence information component stores sequence records and The characteristic component, as shown in Figure 7, contains one or more seqinfo sub-elements in the list_of_seqinfos component, and one seqinfo sub-element represents a sequence record. Normally, NGSGF sequence records are stored in this component. This universal storage format can store content stored in the above 26 formats.

本发明基于设计的结构还开发了相应的编辑和转换软件(图中显示为NGSGFEditor和NGSGF Format Converter)，软件不仅可以编辑基于格式结构的高通量数据文件，还可以对现有的基于文本的高通量数据格式与设计基于XML的通用格式之间进行互转，如图2所示。Based on the designed structure, the present invention has also developed corresponding editing and conversion software (shown as NGSGFEditor and NGSGF Format Converter in the figure). The software can not only edit high-throughput data files based on the format structure, but also edit existing text-based files. Interconversion between high-throughput data formats and common formats designed based on XML is shown in Figure 2.

实施例1Example 1

利用本发明设计的NGSGF格式，可存储高通量测序常用的FASTA、FASTQ、SAM、CAF、VCF格式的数据。Using the NGSGF format designed in the present invention, data in FASTA, FASTQ, SAM, CAF, and VCF formats commonly used for high-throughput sequencing can be stored.

1、序列格式FASTA数据用NGSGF格式存储1. Sequence format FASTA data is stored in NGSGF format.

FASTA的数据：Data from FASTA:

＞KM081703.1 Abbottina rivularis mitochondrion，complete genome↓＞KM081703.1 Abbottina rivularis mitochondrion, complete genome↓

GCTAGTGTAGCTTAATCCAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAAGAAGCTCCGCATGCAC↓GCTAGTGTAGCTTAATCCAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAAGAAGCTCCGCATGCAC↓

＞AF511507.1 Alligator sinensis mitochondrion，complete genome↓＞AF511507.1 Alligator sinensis mitochondrion, complete genome↓

CAATAAAGACTTAGTCCCGGTCTTCTTATTAACTACCACTTAACCTATACATGCAAGCATCCACGAACCA←CAATAAAGACTTAGTCCCGGTCTTTCTTATTAACTACCACTTAACCTATACATGCAAGCATCCACGAACCA←

对应的NGSGF的数据：Corresponding NGSGF data:

2、序列和质量分数格式FASTQ数据用NGSGF格式存储2. Sequence and quality score format FASTQ data is stored in NGSGF format.

FASTQ的数据：Data from FASTQ:

@EAS54_6_R1_2_1_413_324↓@EAS54_6_R1_2_1_413_324↓

CCCTTCTTGTCTTCAGCGTTTCTCC↓CCCTTCTTGTCTTCAGCGTTTCTCC↓

+↓+↓

；；3；；；；；；；；；；；；7；；；；；；；88↓;;3;;;;;;;;;;;7;;;;;;88↓

@EAS54_6_R1_2_1_540_792↓@EAS54_6_R1_2_1_540_792↓

TTGGCAGGCCAAGGCCGATGGATCA↓TTGGCAGGCCAAGGCCGATGGATCA↓

+↓+↓

；；；；；；；；；；；7；；；；；-；；；3；83↓;;;;;;;;;;7;;;;--;;3;83↓

对应的NGSGF的数据：Corresponding NGSGF data:

3、序列比对格式SAM数据用NGSGF格式存储3. Sequence alignment format SAM data is stored in NGSGF format.

SAM的数据：SAM data:

对应的NGSGF的数据：Corresponding NGSGF data:

4、序列组装格式CAF数据用NGSGF格式存储4. Sequence assembly format CAF data is stored in NGSGF format.

CAF的数据：Data from CAF:

DNA：22ak93c2.rlt↓DNA: 22ak93c2.rlt↓

GTCGCnCATAAGATTACGAGATCTCGAGCTCGGTACCCTTCAAGCGATTCTCCTGCCTCA↓GTCGCnCATAAGATTACGAGATCTCGAGCTCGGTACCCTTCAAGCGATTCTCCTGCCTCA↓

↓↓

BaseQuality：22ak93c2.r1t↓BaseQuality: 22ak93c2.r1t↓

4 4 8 4 4 4 4 4 4 4 4 4 6 8 17 21 14 7 6 6 6 7 7 6 8 14 16 21 15 2020↓4 4 8 4 4 4 4 4 4 4 4 4 6 8 17 21 14 7 6 6 6 7 7 6 8 14 16 21 15 2020↓

24 26 21 18 18 14 14 19 23 10 8 8 15 20 16 29 26 34 29 39 29 31 29 31↓24 26 21 18 18 14 14 19 23 10 8 8 15 20 16 29 26 34 29 39 29 31 29 31↓

|↓|↓

Sequence：22ak93c2.r1t↓Sequence: 22ak93c2.r1t↓

Is_read↓Is_read↓

Padded↓Padded↓

Staden_id 11↓Staden_id 11↓

Clipping QUAL 39 331↓Clipping QUAL 39 331↓

Align_to_SCF 1 43 1 43↓Align_to_SCF 1 43 1 43↓

Align_to_SCF 44 317 45 318↓Align_to_SCF 44 317 45 318↓

Align_to_SCF 319 716 319 716↓Align_to_SCF 319 716 319 716↓

SCF_File 22ak93c2.r1tSCF↓SCF_File 22ak93c2.r1tSCF↓

Primer Universal↓primer↓Primer Universal↓primer↓

Strand Reverse↓Strand Reverse↓

Dye Dve_terminator↓Dye Dve_terminator↓

Template 22ak93c2↓Template 22ak93c2↓

Clone bK216E10↓Clone bK216E10↓

Sequencing_vector″m13mp18″↓Sequencing_vector″m13mp18″↓

Seq_vec SVEC 1 38″M13mp18″↓Seq_vec SVEC 1 38″M13mp18″↓

Tag ALUS 43 180↓Tag ALUS 43 180↓

Tag DONE 43 43″AUTO-EDIT：deleted C at 43(terminator，isolated，strong)″↓Tag DONE 43 43″AUTO-EDIT：deleted C at 43(terminator,isolated,strong)″↓

Tag DONE 254 254″AUTO-EDIT：replaced T by g at 254(terminator，isolated，strong)″↓Tag DONE 254 254″AUTO-EDIT: replaced T by g at 254(terminator, isolated, strong)″↓

Tag ALUS 269 402↓Tag ALUS 269 402↓

Tag DONE 283 283″AUTO-EDIT：replaced T by g at 283(terminated，isolated，strong)″↓Tag DONE 283 283″AUTO-EDIT：replaced T by g at 283(terminated, isolated, strong)″↓

Tag AMBG 298 302″AUTOEDIT：Check this edit cluster！″↓Tag AMBG 298 302″AUTOEDIT: Check this edit cluster!″↓

Tag DONE 317 317″AUTO-EDIT：replaced C by a at 317(terminated，compound，strong)″↓Tag DONE 317 317″AUTO-EDIT：replaced C by a at 317(terminated, compound, strong)″↓

Tag DONE 318 318″AUTO-EDIT：inserted g at 318(terminated，compound，strong)″←Tag DONE 318 318″AUTO-EDIT：inserted g at 318(terminated, compound, strong)″←

对应的NGSGF的数据：Corresponding NGSGF data:

5、序列突变格式VCF数据用NGSGF格式存储5. Sequence mutation format VCF data is stored in NGSGF format.

VCF的数据：VCF data:

##fileformat＝VCFv4.2↓##fileformat＝VCFv4.2↓

##fileDate＝20090805↓##fileDate＝20090805↓

##reference＝file：///seq/references/1000GenomesPilot-NCBI36.fasta↓##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta↓

##INFO＝<ID＝NS，Number＝1，Type＝Integer，Description＝″Number ofSamples With Data″>↓##INFO＝<ID＝NS, Number＝1, Type＝Integer, Description="Number ofSamples With Data">↓

##FILTER＝<ID＝q10，Description＝″Quality below 10″>↓##FILTER＝<ID＝q10，Description="Quality below 10">↓

##FOREAT＝<ID＝GT，Number＝1，Type＝String，Description＝″Genotype″>↓##FOREAT＝<ID＝GT, Number＝1, Type＝String, Description="Genotype">↓

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003↓#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003↓

20 14370 rs6054257 G A 29 PAsS NS＝3；DP＝14；AF＝0.5；DB；H2 GT：GQ：DP：HQ0|0：48：1：51，51 1|0：48：8：51，51 1/1：43：5：.，.↓20 14370 rs6054257 G A 29 PAsS NS＝3; DP＝14; AF＝0.5; DB; H2 GT：GQ：DP：HQ0|0：48：1：51，51 1|0：48：8：51，51 1 /1:43:5:.,.↓

对应的NGSGF的数据：Corresponding NGSGF data:

实施例2创建NGSGF文件Example 2 Create NGSGF file

NGSGFEditor是为创建和编辑NGSGF文件而设计的。它有一个用户友好的GUI，也可以在命令行中运行。对用户使用NGSGF文件进行操作将非常有帮助。图8显示了NGSGFEditor的用户界面，显示运行软件(A：Start NGSGFEditor)，转换格式(B：Convert SAM file)，打开文件(C：Open NGSGF file)，编辑文件(D：Edit NGSGF file)过程中的界面。NGSGFEditor is designed for creating and editing NGSGF files. It has a user-friendly GUI and can also be run from the command line. It will be very helpful for users to operate with NGSGF files. Figure 8 shows the user interface of NGSGFEditor, showing the process of running the software (A: Start NGSGFEditor), converting the format (B: Convert SAM file), opening the file (C: Open NGSGF file), and editing the file (D: Edit NGSGF file). interface.

本实施例中的NGSGFEditor是通过NetBeans IDE 10.0用Java编写的，其包含了两个项目，如图9所示。The NGSGFEditor in this embodiment is written in Java through NetBeans IDE 10.0, and contains two projects, as shown in Figure 9.

这里我们使用NGSGFEditor创建一个FASTQ格式的NGSGF文件。FASTQ的内容是：Here we use NGSGFEditor to create an NGSGF file in FASTQ format. The content of FASTQ is:

@EAS54_6_R1_2_1_413_324↓@EAS54_6_R1_2_1_413_324↓

CCCTTCTTGTCTTCAGCGTTTCTCC↓CCCTTCTTGTCTTCAGCGTTTCTCC↓

+↓+↓

@EAS54_6_R1_2_1_540_792↓@EAS54_6_R1_2_1_540_792↓

TTGGCAGGCCAAGGCCGATGGATCA↓TTGGCAGGCCAAGGCCGATGGATCA↓

+↓+↓

；；；；；；；；；；；7；；；；；-；；；3；83←;;;;;;;;;;7;;;;--;;3;83←

步骤1：新建NGSGF文件Step 1: Create a new NGSGF file

单击new按钮创建一个新的NGSGF文件。如图10所示。Click the new button to create a new NGSGF file. As shown in Figure 10.

步骤2：添加序列Step 2: Add Sequence

(1)右键单击(以下均简称“右击”)“ngs”节点，弹出菜单显示。点击“list_of_seqs”菜单创建新节点。如图11(1)所示。(1) Right-click (hereinafter referred to as "right-click") the "ngs" node, and a pop-up menu will appear. Click on the "list_of_seqs" menu to create a new node. As shown in Figure 11(1).

(2)右击“list_of_seqs”节点来增加“seq”子节点。如图11(2)所示。(2) Right-click the "list_of_seqs" node to add the "seq" sub-node. As shown in Figure 11(2).

(3)右击“seq”节点添加“nid”属性。如图11(3)所示。(3) Right-click the "seq" node to add the "nid" attribute. As shown in Figure 11(3).

(4)右击“nid”节点，选择“Edit”菜单编辑节点值。如图11(4)所示。(4) Right-click the "nid" node and select the "Edit" menu to edit the node value. As shown in Figure 11(4).

(5)添加类似“nid”节点的“origin”节点。右击“origin”节点，选择“Edit”菜单，输入序列值。如图11(5)所示。(5) Add an "origin" node similar to the "nid" node. Right-click the "origin" node, select the "Edit" menu, and enter the sequence value. As shown in Figure 11(5).

步骤3：增加质量分数Step 3: Increase Quality Score

(1)右击“ngs”以添加“list_of_quals”节点。如图12(1)所示。(1) Right-click "ngs" to add the "list_of_quals" node. As shown in Figure 12(1).

(2)添加“nid”、“origin”节点。右击“origin”节点增加质量分数。如图12(2)所示。(2) Add "nid" and "origin" nodes. Right-click the "origin" node to increase the quality score. As shown in Figure 12(2).

步骤4：添加序列信息Step 4: Add sequence information

(1)右击“ngs”节点添加“list_of_seqinfos”节点。如图13(1)所示。(1) Right-click the "ngs" node to add the "list_of_seqinfos" node. As shown in Figure 13(1).

(2)右击“list_of_seqinfos”到“seqinfo”节点。如图13(2)所示。(2) Right-click "list_of_seqinfos" to the "seqinfo" node. As shown in Figure 13(2).

(3)右击“seqinfo”节点到“seq”节点。如图13(3)所示。(3) Right-click the "seqinfo" node to the "seq" node. As shown in Figure 13(3).

(4)右击“seq”节点添加“seqref”属性。如图13(4)所示。(4) Right-click the "seq" node to add the "seqref" attribute. As shown in Figure 13(4).

(5)右击“seqinfo”节点到“qual”节点。如图13(5)所示。(5) Right-click the "seqinfo" node to the "qual" node. As shown in Figure 13(5).

(6)在“seq”和“qual”节点中添加“seqref”和“qualref”节点。右击“seqref”和“qualref”节点，输入参考值。在本例中，第一条记录的序列是“s1”，质量分数是“q1”。如图13(6)所示。(6) Add "seqref" and "qualref" nodes to the "seq" and "qual" nodes. Right-click the "seqref" and "qualref" nodes and enter reference values. In this example, the sequence of the first record is "s1" and the quality score is "q1". As shown in Figure 13(6).

像第一条记录一样添加FASTQ文件的第二条记录。Add the second record of the FASTQ file just like the first record.

步骤5：保存NGSGF文件Step 5: Save the NGSGF file

最后，序列保存在“list_of_seqs”节点中，质量分数保存在“list_of_quals”节点中，FASTQ记录保存在“list_of_seqinfos”节点中。如图14所示。Finally, sequences are saved in the "list_of_seqs" node, quality scores are saved in the "list_of_quals" node, and FASTQ records are saved in the "list_of_seqinfos" node. As shown in Figure 14.

实施例3转换NGS文件与NGSGF文件Example 3 Converting NGS files and NGSGF files

用户也可以使用NGSGFEditor在NGS文件和NGSGF文件之间进行转换。Users can also use NGSGFEditor to convert between NGS files and NGSGF files.

目前，NGSGFEditor支持FASTA、FASTQ、SAM、VCF、CAF五种格式。Currently, NGSGFEditor supports five formats: FASTA, FASTQ, SAM, VCF, and CAF.

NGSGFEditor可以在Windows和Linux系统下执行。NGSGFEditor can be executed under Windows and Linux systems.

格式转换可以通过GUI和命令行调用。Format conversion can be invoked via the GUI and command line.

1、通过NGSGFEditor GUI1. Via NGSGFEditor GUI

1.1将FASTQ转换为NGSGF1.1 Convert FASTQ to NGSGF

(1)使用“Add”按钮添加FASTQ文件。(1) Use the "Add" button to add FASTQ files.

(2)输出目录使用“Browse”按钮选择一个文件夹。(2) Output directory Use the "Browse" button to select a folder.

(3)点击“Start”按钮。(3) Click the "Start" button.

如图15(1)所示。As shown in Figure 15(1).

1.2将NGSGF转换为FASTQ1.2Convert NGSGF to FASTQ

(1)使用“Add”按钮添加NGSGF文件。(1) Use the "Add" button to add NGSGF files.

(2)输入选择NGSGF，输出选择FASTQ。(2) Select NGSGF for input and FASTQ for output.

(3)输出目录使用“Browse”按钮选择一个文件夹。(3) Output directory Use the "Browse" button to select a folder.

(4)点击“Start”按钮。(4) Click the "Start" button.

如图15(2)所示。As shown in Figure 15(2).

2、使用NGSGFEditor命令行2. Use NGSGFEditor command line

在Linux系统中实现了该实例。This example is implemented in Linux system.

输入“java-jar NGSGFEditor.jar-h”显示帮助。如图16所示。Enter "java-jar NGSGFEditor.jar -h" to display help. As shown in Figure 16.

2.1将SAM转换为NGSGF2.1 Convert SAM to NGSGF

输入“java-jar NGSGFEditor.jar-c sam2ngsgf-iinput_path-o output_path”将SAM转换为NGSGF。如图17(1)所示。Enter "java-jar NGSGFEditor.jar -c sam2ngsgf-iinput_path -o output_path" to convert SAM to NGSGF. As shown in Figure 17(1).

2.2将NGSGF转换为SAM2.2 Convert NGSGF to SAM

输入“java-jar NGSGFEditor.jar-c ngsgf2sam-iinputpath-o outputpath”将NGSGF转换为SAM。如图17(2)所示。Enter "java-jar NGSGFEditor.jar -c ngsgf2sam-iinputpath -o outputpath" to convert NGSGF to SAM. As shown in Figure 17(2).

可以理解的是，基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。It can be understood that, based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

Claims

1. A method for constructing a universal storage format structure for high-throughput sequencing data. The universal storage format structure for high-throughput sequencing data is designed based on XML and XML Schema technology and includes the following steps:

1) Collect existing high-throughput data formats and classify them into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;

2) Analyze the specific specifications of each format to find commonalities and characteristics;

3) Design a universal storage format based on common and characteristic content. The universal storage format structure of high-throughput sequencing data includes four components: header component, sequence component, quality score component, and sequence information component, where:

The header component is used to store the header description information of the file. The header component contains the sub-element meta_info, which contains name and value attributes;

The sequence component is used to store sequence information. The sequence information is a base sequence or a file path for storing the base sequence. The sequence component contains one or more seq sub-elements to represent the sequence. Each seq sub-element Each has a unique identifier used to locate sequence information components;

The quality score component is used to store the sequence quality score. The sequence quality score is a quality score string or a file path for storing the quality score. The quality score component contains one or more qual sub-elements to represent the sequence quality score. , each qual sub-element has a unique identifier, which is used to locate the sequence information component;

The sequence information component is used to store the records and characteristics of the sequence. The sequence information component contains one or more seqinfo sub-elements, and one seqinfo sub-element represents a sequence record.

2. The method for constructing a universal storage format structure of high-throughput sequencing data according to claim 1, characterized in that the sequence and quality score formats include Fasta/CSFASTA, Fastq/CSFASTQ, Qseq, SCARF, QUAL, 2bit/ nib, SFF format; the alignment format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization Format includes BED, bigBED, Wig, bigWig, BedGraph, GFF/GTF format.

3. Application of the universal storage format structure of high-throughput sequencing data obtained by the construction method of claim 1 or 2 in representing, storing, editing and converting high-throughput sequencing data.