CN113901006B

CN113901006B - Large-scale gene sequencing data storage and query system

Info

Publication number: CN113901006B
Application number: CN202111191995.0A
Authority: CN
Inventors: 邢潇; 卓子寒; 何跃鹰; 国宏哲; 董洪超; 谷杰铭; 张翀; 吕欣润; 张程鹏; 张奕欣
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2024-05-24
Anticipated expiration: 2041-10-13
Also published as: CN113901006A

Abstract

A large-scale gene sequencing data storage and query system belongs to the technical field of gene sequencing data storage and query. The invention solves the problems that the storage amount of gene sequencing data in the existing method is limited by space memory and the data storage efficiency is low. The invention realizes the storage of data through the file storage module and the database, and searches the data stored in the database through the data search module. The invention can improve the storage efficiency of the data by compressing the data. Moreover, by compressing the data, the amount of data stored in the limited space can be significantly increased. The system of the invention can accommodate 6.1PB data volume at most, and the database can store 1200 ten thousand metadata information at most. The invention can be applied to the storage and inquiry of the gene sequencing data.

Description

Large-scale gene sequencing data storage and query system

Technical Field

The invention relates to the technical field of storage and query of gene sequencing data, in particular to a large-scale gene sequencing data storage and query system.

Background

Conventional data storage systems mainly comprise three architectures of DAS, NAS and SAN, but as the number of mass increases, conventional data storage systems have failed to meet the demand for data storage, and thus, distributed storage systems have grown. In recent years, cloud storage systems employing distributed architecture have been rapidly developed as being suitable for many new application scenarios.

Large-scale genome sequencing programs typically produce sequencing data of PB-level, which needs to be recorded in some format on a storage medium internal or external to a computer. Therefore, the rapid storage and retrieval of these genetic sequencing data is a very important issue, as is the security of data storage and the querying of the stored data. However, the existing data storage method cannot meet the high-efficiency requirement for storing the gene sequencing data, and cannot meet the requirement for storing more gene sequencing data in a limited space, so that the storage amount of the gene sequencing data is severely limited by space memory.

Disclosure of Invention

The invention aims to solve the problems that the storage amount of gene sequencing data in the existing method is limited by space memory and the data storage efficiency is low, and provides a system capable of efficiently storing large-scale gene sequencing data and rapidly inquiring the gene sequencing data.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the system comprises a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:

The user login module is used for inputting real-name authentication information by a user, matching the input authentication information with user information stored in a database, if the matching is successful, the user successfully logs in, otherwise, the real-name authentication information needs to be input again for login verification;

the data uploading module is used for uploading the gene sequencing data by the logged-in user and uploading the gene sequencing data uploaded by the user to the cache area;

The user confirms whether the gene sequencing data in the buffer area is correct or not through the interface display module, if so, the user sends a confirmation instruction to the buffer area, then the gene sequencing data in the buffer area is stored into the file storage module, and if not, the gene sequencing data in the buffer area is deleted;

The file storage module sends the stored gene sequencing data to the data compression module;

The data compression module is used for carrying out data compression on the gene sequencing data, uploading the sequencing data file after data compression to the hierarchical directory of the file storage module, deleting the original gene sequencing data corresponding to the sequencing data file in the file storage module, and updating the database;

the data approval module is used for logging in the system by a manager and auditing the sequencing data files uploaded to the database by the user, updating the auditing state of the corresponding sequencing data files in the database after the auditing is passed, permanently storing the sequencing data files passing the auditing, and deleting the sequencing data files which do not pass the auditing in the database if the auditing is not passed;

The data retrieval module is used for retrieving the sequencing data file in the database by a user.

Further, the user information is stored in the database in MD5 encrypted format.

Further, the means for uploading the gene sequencing data by the user comprises single sample uploading and batch sample uploading.

Further, the system also comprises a template uploading module, when the batch samples are uploaded, a user fills single sample information in the batch samples into the template, and the filled template is uploaded to the data uploading module through the template uploading module.

Further, the user fills single sample information in the batch of samples into a template, and the format of the template is an Excel format.

Further, the format of the data-compressed sequencing data file is BAM or VCF.

Further, the data-compressed sequencing data file comprises 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.

Further, the hierarchical relationship of the hierarchical directory in the file storage module is user name, project name, experiment name, sample name and file name.

Further, the specific process of the data compression module for carrying out data compression on the gene sequencing data is as follows:

step1, comparing gene sequencing data to be subjected to data compression with a reference genome, and marking all mutation sites;

step 2, establishing two one-dimensional arrays which are respectively used for recording the starting position and the ending position of a sequence segment where the currently processed gene sequencing data are located;

step 3, constructing a variation position matrix based on marked variation sites and positions of the variation sites in the sequence segments, wherein the dimension of the variation position matrix is M x N, M represents the number of actual samples contained in gene sequencing data to be subjected to data compression, and N represents the length of a reference genome;

The elements in the variation position matrix are 0 or 1,0 indicates that the base of the gene sequencing data at the site is the same as the base corresponding to the reference genome, and 1 indicates that the base of the gene sequencing data at the site is different from the base corresponding to the reference genome;

For the first column of the variation position matrix, sequencing each row of character strings corresponding to the first column according to the reverse prefix dictionary sequence, and so on until each column of the variation position matrix is traversed from left to right, and then obtaining a new variation position matrix;

and 4, compressing the obtained new variation position matrix by adopting run-length coding.

The beneficial effects of the invention are as follows:

The invention provides a large-scale gene sequencing data storage and query system, which has an automatic classification function of gene sequencing data, realizes data storage through a file storage module and a database, and searches the data stored in the database through a data search module. The invention can improve the storage efficiency of the data by compressing the data. Moreover, by compressing the data, the amount of data stored in the limited space can be significantly increased.

The system of the invention can accommodate 6.1PB data volume at most, and the database can store 1200 ten thousand metadata information at most. 51 types of labels can be searched simultaneously, 50 ten thousand data can be displayed at the front end through paging, and the time required by multi-condition searching is not more than 10s.

Drawings

FIG. 1 is a flow chart of a large-scale gene sequencing data storage and query system of the present invention;

FIG. 2 is a data interaction flow diagram;

FIG. 3 is a diagram of a data integration engine;

FIG. 4 is a user login flow chart;

FIG. 5 is a flow chart of a multiple-study data upload;

FIG. 6 is a flow chart of data compression;

fig. 7 is a data retrieval flow chart.

Detailed Description

Detailed description of the inventionthe present embodiment is described with reference to fig. 1,4 and 5. The large-scale gene sequencing data storage and query system of the embodiment comprises a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:

the user login module is used for inputting real-name authentication information by a user, matching the input authentication information with user information stored in a database, if the matching is successful, the user successfully logs in, otherwise, the real-name authentication information needs to be input again for login verification; the login and registration are carried out before the system is used for uploading and retrieving data;

In order to ensure that the file data uploaded by the user is true and effective, a manager is required to check the data uploaded by the user, the manager searches the file list, filters out the data information list which is not checked, searches the file through the file path, and checks whether the file format and the content are correct. And providing a data updating function for the data which passes the auditing, and updating the auditing state of the data in the database. Updating the state of the data which is not successfully checked, and clearing the file corresponding to the data in the file system;

As shown in fig. 7, the data retrieval module provides fuzzy retrieval and multi-information joint retrieval functions; when in fuzzy retrieval, only a keyword or a part of the keyword is input, submitted to a background interface of the data retrieval module, and the retrieval speed is improved through the database keyword index, and the retrieval data list is returned for the user to use. When multi-information combined retrieval is performed, the system classifies sample information, experimental information, file types, multiple groups of study types and sequencing types, each class provides a plurality of classification labels under the class, the classification labels are displayed to a user in the form of check boxes, the user can select a plurality of labels under different classes, and the system packages the retrieval information selected by the user and the information filled in the retrieval boxes into JSON format data to be submitted to a retrieval interface and returns the JSON format data to the user for use in the form of a list. Meanwhile, the data retrieval module also provides a list file export function, and a user selects a column label of a file to be downloaded to export an Excel file.

Before uploading and searching data, the system of the invention carries out user login registration, the user uploading the data needs to provide real-name authentication information, and the real-name authentication information provided by the user is matched with the user information stored in the MD5 encryption format, so that the security can be improved.

The biological information data storage method and the data visualization program of the invention can write the biological information data into a database in a classified manner, and can carry out data association retrieval and visualization.

The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that the user information is stored in the database in the MD5 encrypted format, so as to improve security.

Other steps and parameters are the same as in the first embodiment.

And a third specific embodiment: this embodiment differs from the first or second embodiment in that the manner in which the user uploads the gene sequencing data includes single sample upload and batch sample upload.

When a single sample is uploaded, basic information of an uploading person is provided, and data meta information, namely auxiliary information of the uploading sample, is filled in and used for being associated with a sequencing data file.

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: the difference between the embodiment and one to three embodiments is that the system further includes a template uploading module, when the batch samples are uploaded, the user fills in single sample information in the batch samples into the template, and the filled-in template is uploaded to the data uploading module through the template uploading module.

When uploading in batches, the system provides an uploading template (Excel format), the number of samples in the template and the number of experiments can be automatically generated by users through filling parameters, the template comprises a plurality of single sample information, and after the users fill in, the users upload the template through an uploading template functional unit. And uploading a plurality of sequencing data files to the appointed position of the server, refreshing a front-end page after successful uploading, displaying an uploaded file path list, and submitting the uploaded file path list to a server interface after successful checking by a user. The system provides the functions of analyzing the Excel file and writing the Excel file into the database, and provides the data back display function for a user to check whether the uploaded Excel contains error information or not.

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: the embodiment is different from one to four embodiments in that the user fills the single sample information in the batch of samples into the template, and the format of the template is an Excel format.

Other steps and parameters are the same as in one to four embodiments.

Specific embodiment six: this embodiment differs from one to five embodiments in that the format of the data-compressed sequencing data file is BAM or VCF.

Other steps and parameters are the same as in one of the first to fifth embodiments.

Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that the data-compressed sequencing data file includes 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.

Other steps and parameters are the same as in one of the first to sixth embodiments.

Eighth embodiment: the difference between the present embodiment and one of the first to seventh embodiments is that the hierarchical relationship of the hierarchical directory in the file storage module is user name, project name, experiment name, sample name, and file name.

The established temporary file catalogue is associated with the id of each registered user, and when the user uploads the files, meta information corresponding to the files is uploaded, and the meta information comprises detailed information of data samples of data, basic information of sample users, sequencing information and the like. The user name is the user name of the current login of the system, the project name is the experiment target, and a plurality of experiments can be contained under one project; each experiment further comprises a plurality of pieces of sample information, sequencing platform information, data processing and other information; one or more file result information may be included under each sample.

Other steps and parameters are the same as those of one of the first to seventh embodiments.

Detailed description nine: this embodiment will be described with reference to fig. 6. The difference between the embodiment and one to eight embodiments is that the specific process of the data compression module for performing data compression on the gene sequencing data is:

Step 2, establishing two one-dimensional arrays which are respectively used for recording the starting position and the ending position of a sequence segment where the currently processed gene sequencing data are located; the recorded starting position and the recorded ending position are mainly used for subsequent restoration and decompression calculation of the compressed data;

In order to improve the storage efficiency of a file system, namely store more data in a limited space, the system adopts a resequencing compression (SVRZip) algorithm based on a self-indexing structure, and combines a classical byte coding technology to realize lossless data compression with higher compression ratio, thereby relieving the pressure of data storage and data transmission to a certain extent. The compression method (SRVZip algorithm) in this embodiment is further explained as follows:

In combination with the human reference genome, storage compression of sample sequencing data is achieved by recording and compressing sequence difference information of the sample and the reference genome. Comparing all sequencing sequence fragments read to a reference genome by using deBGA, BWA or Bowtie2 genome sequence comparison software, traversing the cigar information in all sequence comparison results, and marking all mutation sites.

The start and end positions of the sequence segments (reads) are recorded through two one-dimensional arrays, and the sequence segments are dynamically added into and withdrawn from the coding sequence when PBWT coding processing is carried out. PBWT introduces two one-dimensional arrays of start [ M ] and end [ M ] for determining that the sequence fragment (reads) corresponding to the currently processed site needs to be dynamically added to the processing queue.

And (3) constructing an original sequence comparison mutation position matrix of M sample numbers (actual sample numbers in a genome plan) and N site numbers (reference genome length), wherein when differential encoding is carried out based on a reference genome, 0 represents that the site base is the same as the base corresponding to the reference genome, and 1 represents that the site base is different from the base corresponding to the reference genome. And performing PBWT matrix conversion operation, and sequencing the character strings of each row according to the reverse prefix dictionary sequence of the current position, and traversing the matrix from left to right until the last position is encountered.

After the PBWT matrix sequence conversion is completed, consecutive 1 s or 0s occur in the vertical direction, so that run-length encoding can be used for compression. Because the number of alphabets of genome data is small, the specific variant sequence information is compressed efficiently by adopting a classical HUFFMAN coding mode.

For partial mass number information in sequencing data, firstly, thinning and homogenizing are carried out, so that different mass number information is converted to be identical and continuous identical digital content is formed, and run-length coding can be adopted for compression. Wherein the differentiated simple mass numbers are also compressed using run-length encoding.

In the embodiment, the lossless data compression with higher compression ratio is performed on the gene sequencing data, so that the pressure of data storage and data transmission is relieved to a certain extent, the high efficiency of file system storage is improved, and more data can be stored in a limited space.

Other steps and parameters are the same as in one to eight of the embodiments.

As shown in fig. 2 and fig. 3, the storage module is a file system on the server, when multiple groups of data are uploaded, the data are uploaded to the cache area first, after the user checks that the user has no errors, a confirmation instruction is sent to the cache area of the server through the background interface, the data in the cache area are stored in the permanent storage area, and meanwhile, the database is updated. The file system can accommodate 6.1PB of data at most. The user can access the application interface through the PC end browser, so that the functions of data query, data uploading and data storage are realized.

When using the system, the user first logs in and registers, as shown in fig. 4, the login is divided into an administrator login and a common user login, the different roles can use different authorities, and the common user can upload the multi-study data file and the metadata information to the system of the invention and can also perform data retrieval. The administrator can audit the uploaded data and determine whether the data can pass the audit or not for the user to retrieve.

After the user logs in successfully, the multi-study file and metadata information can be uploaded to the system, as shown in fig. 5, the uploading flow comprises multi-sample multi-study data uploading and single-sample multi-study data uploading, the user directly inputs information when selecting a single sample, and when selecting multi-sample uploading, the user needs to fill in parameters and download a template, and inputs data information in the template. After the information is filled, uploading the file to the directory of the current user of the file system through the file transmission system, checking whether the file is correct after the file is successfully uploaded, and after the file is confirmed to be correct, successfully uploading the file. For the successfully uploaded file, in order to further improve the space utilization rate of the file system, the system adopts a resequencing compression algorithm (SRVZip) based on a self-indexing structure, further compresses gene sequencing data, rewrites the compressed file into a hierarchical directory, and simultaneously deletes the original file, thereby improving the storage efficiency of the file system, as shown in fig. 6.

When a user uses the search operation, single condition search and multi-condition combined search can be selected, as shown in fig. 7, the single condition only needs to input keywords in an input box, the keywords are matched into a database through an interface after the keywords are submitted, and a search list is returned after the matching is successful. The multi-condition joint search can be used for selecting a plurality of labels under a plurality of different classifications, combining search fields after being selected, submitting the combined search fields to an interface to perform joint query on a plurality of tables in a database, and returning a search list.

The invention provides an automatic classifying, storing and retrieving system for human genetic data, and establishes a comprehensive large-scale genome data management system with functions of automatic classifying, storing, retrieving and visualizing of human genetic data. Meanwhile, a genetic data quality evaluation method of multiple groups of science and a sensitive genetic data identification evaluation system are established, and linkage is established with a biological data network transmission monitoring system, so that the behaviors of illegal output, malicious counterfeiting, expansion and the like of genetic data are effectively screened.

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The large-scale gene sequencing data storage and query system is characterized by comprising a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:

The specific process of the data compression module for carrying out data compression on the gene sequencing data is as follows:

step 4, compressing the obtained new variation position matrix by adopting run length coding;

2. The large scale gene sequencing data storage and query system of claim 1, wherein said user information is stored in a database in MD5 encrypted format.

3. The large scale gene sequencing data storage and query system of claim 2, wherein said means for uploading gene sequencing data by a user comprises single sample uploading and bulk sample uploading.

4. The large-scale genetic sequencing data storage and query system of claim 3, further comprising a template upload module, wherein when uploading the batch of samples, the user fills single sample information in the batch of samples into the template, and the filled template is uploaded to the data upload module by the template upload module.

5. The large scale gene sequencing data storage and query system of claim 4, wherein said user fills single sample information in a batch of samples into templates, the templates being in Excel format.

6. The large scale gene sequencing data storage and query system of claim 5, wherein said data compressed sequencing data file is in the format of BAM or VCF.

7. The large scale gene sequencing data storage and query system of claim 6, wherein said data compressed sequencing data files comprise 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.

8. The large-scale genetic sequencing data storage and query system of claim 7, wherein the hierarchical relationship of hierarchical directories in the file storage module is user name→project name→experiment name→sample name→file name.