CN111881324B - High-throughput sequencing data general storage format structure, construction method and application thereof - Google Patents
High-throughput sequencing data general storage format structure, construction method and application thereof Download PDFInfo
- Publication number
- CN111881324B CN111881324B CN202010748559.8A CN202010748559A CN111881324B CN 111881324 B CN111881324 B CN 111881324B CN 202010748559 A CN202010748559 A CN 202010748559A CN 111881324 B CN111881324 B CN 111881324B
- Authority
- CN
- China
- Prior art keywords
- sequence
- format
- formats
- component
- quality score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012165 high-throughput sequencing Methods 0.000 title claims abstract description 24
- 238000010276 construction Methods 0.000 title claims abstract description 7
- 238000000034 method Methods 0.000 claims description 8
- 230000035772 mutation Effects 0.000 claims description 7
- 238000012800 visualization Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 6
- 230000007547 defect Effects 0.000 abstract 1
- 238000007481 next generation sequencing Methods 0.000 description 12
- 238000006243 chemical reaction Methods 0.000 description 10
- 238000013500 data storage Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 238000012163 sequencing technique Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 210000003470 mitochondria Anatomy 0.000 description 2
- 238000002864 sequence alignment Methods 0.000 description 2
- PUAQLLVFLMYYJJ-UHFFFAOYSA-N 2-aminopropiophenone Chemical compound CC(N)C(=O)C1=CC=CC=C1 PUAQLLVFLMYYJJ-UHFFFAOYSA-N 0.000 description 1
- 241001275935 Abbottina rivularis Species 0.000 description 1
- 241001520221 Alligator sinensis Species 0.000 description 1
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 101150021948 SAM2 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/81—Indexing, e.g. XML tags; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/84—Mapping; Conversion
- G06F16/86—Mapping to a database
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a general storage format structure of high-throughput sequencing data, a construction method and application thereof. By the invention, different types of high-throughput sequencing data can be stored in one format, so that the defect that the interoperability of the data is influenced due to the diversity of the data formats is overcome. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.
Description
Technical Field
The invention belongs to the technical field of biological information processing, and relates to a general storage format structure of high-throughput sequencing data, a construction method and application thereof.
Background
With the rapid development of high throughput sequencing technologies, the instrumentation or vendor used in sequencing, sequencing principles, and differences in development context or goal, such as readability, integration, space savings, and other factors, have produced an increasing variety of sequencing data. To analyze these data, a number of analysis software have been designed, but most of these define their own data storage formats (S.Pabinger, A.Dander, M.Fischer, R.Snajder, M.Sperk, M.Efremova, B.Krabichler, M.R.Speicher, J.Zschocke, and Z.Trajanoski, "A survey oftools for variant analysis ofnext-generation genome sequencing data," BriefBioInform, vol.15, no.2, pp.256-78, mar, 2014). For example, BAM/FASTQ/QSEQ, BAM/HDF5/FASTQ and BAM/SFF/FASTQ are file formats that can be handled by Illumina, pacBio and Ion Torrent sequencers, respectively. The above causes a variety of data formats.
Data interoperability is a key element in large data analysis, and many format conversion tools have been developed successively, whose main function is to convert high-throughput sequencing data from one format to another (H.Li, B.Handsaker, A.Wysoker, T.Fennell, J.Ruan, N.Homer, G.Marth, G.Abecasis, and R.Durbin, "The Sequence Alignment/Map format and SAMtools," Bioinformation, vol.25, no.16, pp.2078-9, aug 15,2009.; M.R. Breese, and Y.Liu, "NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets," Bioinformation, vol.29, no.4, pp.494-6, feb15, 2013.).
However, they are mostly developed for specific and limited formats, and format conversion not only loses information, but also requires great computational resources. If a format is encountered where no ready-made tool has been converted to the desired format, it is not an easy matter for non-professional program developers to wait for others to develop or write programs themselves to implement one.
Disclosure of Invention
Aiming at the technical problems, the invention aims to provide a general storage format structure of high-throughput sequencing data, a construction method and application thereof. Different types of high-throughput sequencing data can be stored in one format, thus overcoming the impact of data interoperability due to the diversity of data formats. And meanwhile, the universal format is structured, and compared with text unstructured data, the universal format is easier and quicker to filter and extract data.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
the invention provides a general storage format structure of high-throughput sequencing data, which comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:
the header component is used for storing header description information of the file;
the sequence component is used for storing sequence information, wherein the sequence information is a base sequence or a file path for storing the base sequence;
the quality score component is used for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score;
the sequence information component is used for storing records and features of the sequence.
Preferably, the high-throughput sequencing data universal storage format structure is designed based on XML and XML Schema technology.
Preferably, the header component contains a sub-element meta_info, in which name and value attributes are contained; the sequence component comprises one or more seq sub-elements to represent a sequence, and each seq sub-element has a unique identification and is used for a sequence information component; the quality score component comprises one or more quality subelements to represent the sequence quality score, and each quality subelement has a unique identification for the sequence information component; the sequence information component contains one or more seqinfo sub-elements, and one seqinfo sub-element represents a sequence record.
The invention also provides an editing tool based on the high-throughput sequencing data general storage format structure, which is used for creating and editing the high-throughput sequencing data general storage format file and converting the format between the NGS file and the NGSGF file.
Preferably, the editing tool is written in Java through NetBeans IDE 10.0.
Preferably, the editing tool executes corresponding operations through GUI and command line calls.
Preferably, the formats that the editing tool supports conversion include FASTA, FASTQ, SAM, VCF, CAF.
The invention also provides a construction method of the high-throughput sequencing data general storage format structure, which comprises the following steps:
1) Existing high-throughput data formats are collected and classified into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;
2) Analyzing the specific specification of each format, and searching the content of commonality and characteristics;
3) The common storage format structure is designed based on the content of the commonality and the characteristics.
Preferably, the sequence and quality score formats include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.
The invention also provides application of the general storage format structure of the high-throughput sequencing data in representing, storing, editing and converting the high-throughput sequencing data.
The invention designs a format structure based on XML and XML Schema technology, which can store a plurality of different types of high-throughput sequencing data at present, the format structure prescribes the structure of data storage, and the specific storage content is changeable according to sequence information, so that the format not only can store the data in the existing format, but also can cope with the newly-appearing data format.
The beneficial effects of the invention are as follows:
firstly, the general storage format structure of the invention uses a component structure to divide the sequence and the description information into four parts, so that the format structure is clear and has good self-description, and is convenient for future expansion.
Secondly, the universal storage format structure of the present invention introduces a reference idea into the biological data format, a technique widely used in computer science. In this general storage format, in the form of links, different sequence information may refer to the same sequence or quality score if the content is the same or similar. It can avoid storing duplicate content.
Thirdly, the universal storage format structure of the invention fully utilizes the advantages of the currently popular NGS data format, and can store most of biological sequence information. In addition, the generic storage format structure inherits the flexibility and extensibility of XML. Due to the rapid development of NGS technology, new concepts and analysis tools are emerging, and old data formats are difficult to adapt to current requirements. The expandability of the general storage format structure of the invention overcomes the problem of specific data formats, and the flexibility of the general storage format structure can adapt to the needs of future development.
Finally, the general storage format structure of the present invention is well readable, so that it can be easily handled by a computer program, and is more readable to humans, and the stored content is easier to understand. This advantage can be attributed to the tree structure nature of XML.
Drawings
Fig. 1 shows 26 high-throughput data storage formats commonly used in the prior art.
Fig. 2 shows the general technical architecture of the present invention.
Fig. 3 shows the overall format structure of the NGSGF of the present invention.
Fig. 4 shows the format structure of the NGSGF header component of the present invention.
Fig. 5 shows the format structure of the NGSGF sequence component of the present invention.
Fig. 6 shows the format structure of the NGSGF quality score component of the present invention.
Fig. 7 shows the format structure of the NGSGF sequence information component of the present invention.
Fig. 8 shows a user interface screenshot of NGSGFEditor in embodiment 2 of the present invention.
Fig. 9 shows two item shots of NGSGFEditor in embodiment 2 of the present invention.
Fig. 10 shows the method of example 2 "step 1: newly created NGSGF file "interface screenshot.
Fig. 11 shows the method of example 2 "step 2: the sequence "interface screenshot" is added.
Fig. 12 shows the method of example 2 "step 3: the quality score interface screenshot is added.
Fig. 13 shows the method of example 2", step 4: sequence information "interface screenshot" is added.
Fig. 14 shows the method of example 2", step 5: and storing an NGSGF file interface screenshot.
Fig. 15 shows a screenshot of embodiment 3 of the present invention for converting FASTQ and NGSGF format files through the NGSGFEditor GUI.
FIG. 16 shows a screenshot of the display aid of the input "java-jarNGSGFEditor. Jar-h" in example 3 of the present invention.
Fig. 17 shows an interface screenshot of embodiment 3 of the present invention for converting SAM and NGSGF format files using NGSGFEditor command line.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings.
In order to solve the compatibility problem of the present NGS data, we have developed a new general storage format based on XML, hereinafter abbreviated as NGSGF, which can satisfy most NGS data types. NGSGF is based on extensible markup language (XML), which is widely used in the fields of data storage on the Internet, mathematics, biology, and the like. NGSGF is used to describe data produced by NGS technology, and different types of information used by NGS are integrated into NGSGF, such as alignment, assembly, and annotation information. Because of the high degree of extensibility of XML, NGSGF is easily extended with new features.
The invention firstly researches the data storage format adopted in the current high-throughput sequencing data field. A total of 26 commonly used high-throughput data formats were collected and they were divided into five types: sequence and quality score formats (Sequence or quality score), alignment formats (Alignment), assembly formats (Assembly), mutation formats (Variant), sequence annotation and visualization formats (Sequence annotation & visualization), which may include Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the alignment format may include: SAM, BAM, bowtie, maq format; the assembly format may include ACE, AFG, CAF format; the mutation format may include a GVF, pileup, VCF format; the annotation and visualization formats may include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats, as shown in fig. 1.
The specific specifications of each format are then analyzed, primarily to analyze the content and the organization of the content stored in the format. After grasping the specific specification of each format, the content of commonalities and characteristics is sought. The common storage format is designed based on the content of commonalities and characteristics. As shown in fig. 3, the format NGSGF newly proposed by the present invention includes four components: a header component (header_lines), a sequence component (list_of_seqs), a quality score component (list_of_quals), a sequence information component (list_of_seqinfo), wherein the header component is a component storing header description information, and most existing NGS file formats contain header information to describe the stored content. As shown in fig. 4, the child element meta_info is contained in the header_lines. The meta_info contains a name attribute and a value attribute, and is used for storing the header description information of the NGS; the sequence component is a component storing sequence information, and the sequence information is a base sequence or a file path storing the base sequence. The deposit file path enables the NGSGF to store large sequence files. As shown in FIG. 5, one or more seq sub-elements are included in the list_of_seqs component to represent a sequence. Each seq subelement has a unique identifier for use in the list_of_seqinfos component; the quality score component is a component for storing the quality score of the sequence, and the quality score of the sequence is a quality score character string or a file path for storing the quality score. The deposit file path enables the NGSGF to store large quality score files. As shown in FIG. 6, one or more quality subelements are included in the list_of_quality component to represent the quality score of the sequence, each quality subelement having a unique identifier for use in the list_of_seqinfo component; the sequence information component is a component that stores sequence records and features, and as shown in fig. 7, one or more seqinfo sub-elements are contained in the list_of_seqinfos component, one of which represents one sequence record. Typically, NGSGF sequence records are stored in this component. The common storage format is capable of storing the content stored in the above 26 formats.
The design-based structure of the present invention also develops corresponding editing and conversion software (shown as NGSGF Editor and NGSGF Format Converter in the figures) that not only can edit high-throughput data files based on the format structure, but also can interwork existing text-based high-throughput data formats with general-purpose formats based on XML, as shown in fig. 2.
Example 1
The NGSGF format designed by the invention can store data in FASTA, FASTQ, SAM, CAF, VCF format commonly used for high-throughput sequencing.
1. Sequence format FASTA data is stored in NGSGF format
Data of FASTA:
>KM081703.1 Abbottina rivularis mitochondrion,complete genome↓
GCTAGTGTAGCTTAATCCAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAAGAAGCTCCGCATGCAC↓
>AF511507.1 Alligator sinensis mitochondrion,complete genome↓
CAATAAAGACTTAGTCCCGGTCTTCTTATTAACTACCACTTAACCTATACATGCAAGCATCCACGAACCA←
data of corresponding NGSGF:
2. sequence and quality score formatted FASTQ data is stored in NGSGF format
Data for FASTQ:
@EAS54_6_R1_2_1_413_324↓
CCCTTCTTGTCTTCAGCGTTTCTCC↓
+↓
;;3;;;;;;;;;;;;7;;;;;;;88↓
@EAS54_6_R1_2_1_540_792↓
TTGGCAGGCCAAGGCCGATGGATCA↓
+↓
;;;;;;;;;;;7;;;;;-;;;3;83↓
data of corresponding NGSGF:
3. SAM data in sequence alignment format is stored in NGSGF format
Data for SAM:
data of corresponding NGSGF:
4. sequence assembly format CAF data is stored in NGSGF format
CAF data:
DNA:22ak93c2.rlt↓
GTCGCnCATAAGATTACGAGATCTCGAGCTCGGTACCCTTCAAGCGATTCTCCTGCCTCA↓
↓
BaseQuality:22ak93c2.r1t↓
4 4 8 4 4 4 4 4 4 4 4 4 6 8 17 21 14 7 6 6 6 7 7 6 8 14 16 21 15 20 20↓
24 26 21 18 18 14 14 19 23 10 8 8 15 20 16 29 26 34 29 39 29 31 29 31↓
|↓
Sequence:22ak93c2.r1t↓
Is_read↓
Padded↓
Staden_id 11↓
Clipping QUAL 39 331↓
Align_to_SCF 1 43 1 43↓
Align_to_SCF 44 317 45 318↓
Align_to_SCF 319 716 319 716↓
SCF_File 22ak93c2.r1tSCF↓
Primer Universal↓primer↓
Strand Reverse↓
Dye Dve_terminator↓
Template 22ak93c2↓
Clone bK216E10↓
Sequencing_vector″m13mp18″↓
Seq_vec SVEC 1 38″M13mp18″↓
Tag ALUS 43 180↓
Tag DONE 43 43″AUTO-EDIT:deleted C at 43(terminator,isolated,strong)″↓
Tag DONE 254 254″AUTO-EDIT:replaced T by g at 254(terminator,isolated,strong)″↓
Tag ALUS 269 402↓
Tag DONE 283 283″AUTO-EDIT:replaced T by g at 283(terminated,isolated,strong)″↓
Tag AMBG 298 302″AUTOEDIT:Check this edit cluster!″↓
Tag DONE 317 317″AUTO-EDIT:replaced C by a at 317(terminated,compound,strong)″↓
Tag DONE 318 318″AUTO-EDIT:inserted g at 318(terminated,compound,strong)″←
data of corresponding NGSGF:
5. sequence mutation format VCF data is stored in NGSGF format
Data of VCF:
##fileformat=VCFv4.2↓
##fileDate=20090805↓
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta↓
##INFO=<ID=NS,Number=1,Type=Integer,Description=″Number of Samples With Data″>↓
##FILTER=<ID=q10,Description=″Quality below 10″>↓
##FOREAT=<ID=GT,Number=1,Type=String,Description=″Genotype″>↓
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003↓
20 14370 rs6054257 G A 29 PAsS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.↓
data of corresponding NGSGF:
example 2 creation of NGSGF files
Ngsgmonitor is designed for creating and editing NGSGF files. It has a user friendly GUI that can also run in command lines. It would be very helpful for the user to use NGSGF files to operate. FIG. 8 shows the user interface of the NGSGFEditor, showing the interface during running software (A: start NGSGFEditor), converting format (B: conversion SAM file), opening file (C: open NGSGF file), editing file (D: edit NGSGF file).
The NGSGFEditor in this embodiment is written in Java via NetBeans IDE 10.0, which contains two items, as shown in fig. 9.
Here we use NGSGFEditor to create a FASTQ format NGSGF file. The contents of FASTQ are:
@EAS54_6_R1_2_1_413_324↓
CCCTTCTTGTCTTCAGCGTTTCTCC↓
+↓
;;3;;;;;;;;;;;;7;;;;;;;88↓
@EAS54_6_R1_2_1_540_792↓
TTGGCAGGCCAAGGCCGATGGATCA↓
+↓
;;;;;;;;;;;7;;;;;-;;;3;83←
step 1: new NGSGF file
Clicking the new button creates a new NGSGF file. As shown in fig. 10.
Step 2: addition sequence
(1) Right clicking (hereinafter, simply referred to as "right clicking") the "ngs" node, popup menu display. Clicking on the "list_of_seqs" menu creates a new node. As shown in fig. 11 (1).
(2) Right clicking on the "list_of_seqs" node increases the "seq" child node. As shown in fig. 11 (2).
(3) Right clicking on the "seq" node adds the "nid" attribute. As shown in fig. 11 (3).
(4) Right clicking on the "nid" node selects the "Edit" menu Edit node value. As shown in fig. 11 (4).
(5) An "origin" node like the "nid" node is added. Right clicking on the "origin" node selects the "Edit" menu and enters the sequence value. As shown in fig. 11 (5).
Step 3: increasing mass fraction
(1) Right click "ngs" to add the "list_of_quals" node. As shown in fig. 12 (1).
(2) "nid", "origin" nodes are added. Right clicking on the "origin" node increases the quality score. As shown in fig. 12 (2).
Step 4: adding sequence information
(1) Right clicking on the "ngs" node adds the "list_of_seqinfos" node. As shown in fig. 13 (1).
(2) Right-clicking on the "list_of_seqinfos" to "seqinfo" node. As shown in fig. 13 (2).
(3) Right clicking on the "seqinfo" node to the "seq" node. As shown in fig. 13 (3).
(4) Right clicking on the "seq" node adds the "seqref" attribute. As shown in fig. 13 (4).
(5) Right clicking on the "seqinfo" node to the "quat" node. As shown in fig. 13 (5).
(6) The "seqref" and "qualref" nodes are added in the "seq" and "qual" nodes. Right clicking on the "seqref" and "qualref" nodes inputs the reference value. In this example, the sequence of the first record is "s1", and the mass fraction is "q1". As shown in fig. 13 (6).
The second record of the FASTQ file is added like the first record.
Step 5: preserving NGSGF files
Finally, the sequence is stored in the "list_of_seqs" node, the quality score is stored in the "list_of_quals" node, and the FASTQ record is stored in the "list_of_seqinfo" node. As shown in fig. 14.
Example 3 conversion of NGS files and NGSGF files
The user may also use ngsgmonitor to convert between NGS files and NGSGF files.
Currently, NGSGFEditor supports FASTA, FASTQ, SAM, VCF, CAF five formats.
Ngsgmonitor may be executed under Windows and Linux systems.
Format conversion may be invoked through a GUI and command line.
1. Through NGSGFEditor GUI
1.1 conversion of FASTQ to NGSGF
(1) The FASTQ file is added using the "Add" button.
(2) The output directory selects a folder using the "Browse" button.
(3) Clicking the "Start" button.
As shown in fig. 15 (1).
1.2 conversion of NGSGF to FASTQ
(1) NGSGF files are added using the "Add" button.
(2) The input selects NGSGF and the output selects FASTQ.
(3) The output directory selects a folder using the "Browse" button.
(4) Clicking the "Start" button.
As shown in fig. 15 (2).
2. Using ngsgmonitor command lines
This example is implemented in the Linux system.
The input "java-jar ngsgmonitor. Jar-h" displays help. As shown in fig. 16.
2.1 conversion of SAM to NGSGF
The input "java-jar ngsgmonitor. Jar-c SAM2 NGSGF-input_path-o output_path" converts SAM into NGSGF. As shown in fig. 17 (1).
2.2 conversion of NGSGF to SAM
The input "java-jar ngsgmonitor. Jar-c NGSGF2 SAM-input path-o output path" converts NGSGF to SAM. As shown in fig. 17 (2).
It is understood that all other embodiments, which can be made by one of ordinary skill in the art without inventive effort, are within the scope of the present invention based on the embodiments of the present invention.
Claims (3)
1. The method for constructing the high-throughput sequencing data general storage format structure is designed based on XML and XML Schema technology and comprises the following steps:
1) Existing high-throughput data formats are collected and classified into five types: sequence and quality score formats, alignment formats, assembly formats, mutation formats, annotation and visualization formats;
2) Analyzing the specific specification of each format, and searching the content of commonality and characteristics;
3) The common storage format is designed based on the content of commonality and characteristics, and the high-throughput sequencing data common storage format structure comprises four components: head component, sequence component, quality fraction component, sequence information component, wherein:
the header component is used for storing header description information of a file, and comprises a sub-element meta_info, wherein the meta_info comprises a name attribute and a value attribute;
the sequence component is used for storing sequence information, the sequence information is a base sequence or a file path for storing the base sequence, the sequence component comprises one or more seq sub-elements for representing the sequence, and each seq sub-element is provided with a unique identifier and is used for positioning the sequence information component;
the quality score component is used for storing a sequence quality score, the sequence quality score is a quality score character string or a file path for storing the quality score, the quality score component comprises one or more quality subelements for representing the sequence quality score, and each quality subelements has a unique identifier for positioning the sequence information component;
the sequence information component is used for storing records and features of a sequence, and comprises one or more seqinfo sub-elements, wherein one seqinfo sub-element represents a sequence record.
2. The method for constructing a high-throughput sequencing data universal storage format structure according to claim 1, wherein the sequence and mass fraction formats comprise Fasta/CSFASTA, fastq/CSFASTQ, qseq, SCARF, QUAL, 2bit/nib, SFF formats; the comparison format includes: SAM, BAM, bowtie, maq format; the assembly format includes ACE, AFG, CAF format; the mutation format includes GVF, pileup, VCF format; the annotation and visualization formats include BED, bigBED, wig, bigWig, bedGraph, GFF/GTF formats.
3. Use of the high-throughput sequencing data generic storage format structure obtained by the construction method of claim 1 or 2 for representing, storing, editing and converting high-throughput sequencing data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010748559.8A CN111881324B (en) | 2020-07-30 | 2020-07-30 | High-throughput sequencing data general storage format structure, construction method and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010748559.8A CN111881324B (en) | 2020-07-30 | 2020-07-30 | High-throughput sequencing data general storage format structure, construction method and application thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111881324A CN111881324A (en) | 2020-11-03 |
CN111881324B true CN111881324B (en) | 2023-12-15 |
Family
ID=73204229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010748559.8A Active CN111881324B (en) | 2020-07-30 | 2020-07-30 | High-throughput sequencing data general storage format structure, construction method and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111881324B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103546160A (en) * | 2013-09-22 | 2014-01-29 | 上海交通大学 | Multi-reference-sequence based gene sequence stage compression method |
CN104169927A (en) * | 2012-02-28 | 2014-11-26 | 皇家飞利浦有限公司 | Compact next generation sequencing database and efficient sequence processing using same |
WO2015180203A1 (en) * | 2014-05-30 | 2015-12-03 | 周家锐 | High-throughput dna sequencing quality score lossless compression system and compression method |
WO2016105579A1 (en) * | 2014-12-22 | 2016-06-30 | Board Of Regents Of The University Of Texas System | Systems and methods for processing sequence data for variant detection and analysis |
CN105760706A (en) * | 2014-12-15 | 2016-07-13 | 深圳华大基因研究院 | Compression method for next generation sequencing data |
CN106446600A (en) * | 2016-05-20 | 2017-02-22 | 同济大学 | CRISPR/Cas9-based sgRNA design method |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN107609350A (en) * | 2017-09-08 | 2018-01-19 | 厦门极元科技有限公司 | A kind of data processing method of two generations sequencing data analysis platform |
WO2019150287A1 (en) * | 2018-01-30 | 2019-08-08 | Encapsa Technology Llc | Method and system for encapsulating and storing information from multiple disparate data sources |
CN110517726A (en) * | 2019-07-15 | 2019-11-29 | 西安电子科技大学 | A kind of microbe composition and concentration detection method based on high-flux sequence data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3005200A2 (en) * | 2013-06-03 | 2016-04-13 | Good Start Genetics, Inc. | Methods and systems for storing sequence read data |
-
2020
- 2020-07-30 CN CN202010748559.8A patent/CN111881324B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104169927A (en) * | 2012-02-28 | 2014-11-26 | 皇家飞利浦有限公司 | Compact next generation sequencing database and efficient sequence processing using same |
CN103546160A (en) * | 2013-09-22 | 2014-01-29 | 上海交通大学 | Multi-reference-sequence based gene sequence stage compression method |
WO2015180203A1 (en) * | 2014-05-30 | 2015-12-03 | 周家锐 | High-throughput dna sequencing quality score lossless compression system and compression method |
CN105760706A (en) * | 2014-12-15 | 2016-07-13 | 深圳华大基因研究院 | Compression method for next generation sequencing data |
WO2016105579A1 (en) * | 2014-12-22 | 2016-06-30 | Board Of Regents Of The University Of Texas System | Systems and methods for processing sequence data for variant detection and analysis |
CN106446600A (en) * | 2016-05-20 | 2017-02-22 | 同济大学 | CRISPR/Cas9-based sgRNA design method |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN107609350A (en) * | 2017-09-08 | 2018-01-19 | 厦门极元科技有限公司 | A kind of data processing method of two generations sequencing data analysis platform |
WO2019150287A1 (en) * | 2018-01-30 | 2019-08-08 | Encapsa Technology Llc | Method and system for encapsulating and storing information from multiple disparate data sources |
CN110517726A (en) * | 2019-07-15 | 2019-11-29 | 西安电子科技大学 | A kind of microbe composition and concentration detection method based on high-flux sequence data |
Non-Patent Citations (2)
Title |
---|
NGS-FC: A Next-Generation Sequencing Data Format Converter;Chunjiang Yu等;IEEE;第1683-1691页 * |
XML for Data Representation and Model Specification in Neuroscience;Sharon M. Crook等;Neuroinformatics;第53–66页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111881324A (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11651149B1 (en) | Event selection via graphical user interface control | |
US11423216B2 (en) | Providing extraction results for a particular field | |
US10783318B2 (en) | Facilitating modification of an extracted field | |
US8972372B2 (en) | Searching code by specifying its behavior | |
US9026901B2 (en) | Viewing annotations across multiple applications | |
CN108469952B (en) | Code generation method and matched tool for managing game configuration | |
CN102135938B (en) | Software product testing method and system | |
JP2000181917A (en) | Structured document managing method, executing device therefor and medium recording processing program therefor | |
US20060101392A1 (en) | Strongly-typed UI automation model generator | |
CN112667735A (en) | Visualization model establishing and analyzing system and method based on big data | |
CN108804300A (en) | Automated testing method and system | |
Borowski et al. | Graph Buddy—an interactive code dependency browsing and visualization tool | |
Kienle et al. | Evolution of web systems | |
CN111881324B (en) | High-throughput sequencing data general storage format structure, construction method and application thereof | |
CN116107524B (en) | Low-code application log processing method, medium, device and computing equipment | |
JP2009211599A (en) | Mapping definition creation system and mapping definition creation program | |
Leonard et al. | SQL Server 2012 integration services design patterns | |
CN112395818A (en) | Hardware algorithm model construction method based on SysML | |
Belaid et al. | An Ontology and Indexation based Management of Services and Workflows Application to Geological Modeling. | |
CN111124548B (en) | Rule analysis method and system based on YAML file | |
Hyams et al. | “You could use the API!”: A Crash Course in Working with the Alma APIs using Postman | |
Wang et al. | An aspect-oriented UML tool for software development with early aspects | |
Verbeek et al. | Visualizing state spaces with Petri nets | |
CN115248803B (en) | Collection method and device suitable for network disk file, network disk and storage medium | |
Jordão et al. | TypeTaxonScript: sugarifying and enhancing data structures in biological systematics and biodiversity research |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |