CN113901006B - Large-scale gene sequencing data storage and query system - Google Patents

Large-scale gene sequencing data storage and query system Download PDF

Info

Publication number
CN113901006B
CN113901006B CN202111191995.0A CN202111191995A CN113901006B CN 113901006 B CN113901006 B CN 113901006B CN 202111191995 A CN202111191995 A CN 202111191995A CN 113901006 B CN113901006 B CN 113901006B
Authority
CN
China
Prior art keywords
data
sequencing data
module
gene sequencing
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111191995.0A
Other languages
Chinese (zh)
Other versions
CN113901006A (en
Inventor
邢潇
卓子寒
何跃鹰
国宏哲
董洪超
谷杰铭
张翀
吕欣润
张程鹏
张奕欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202111191995.0A priority Critical patent/CN113901006B/en
Publication of CN113901006A publication Critical patent/CN113901006A/en
Application granted granted Critical
Publication of CN113901006B publication Critical patent/CN113901006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A large-scale gene sequencing data storage and query system belongs to the technical field of gene sequencing data storage and query. The invention solves the problems that the storage amount of gene sequencing data in the existing method is limited by space memory and the data storage efficiency is low. The invention realizes the storage of data through the file storage module and the database, and searches the data stored in the database through the data search module. The invention can improve the storage efficiency of the data by compressing the data. Moreover, by compressing the data, the amount of data stored in the limited space can be significantly increased. The system of the invention can accommodate 6.1PB data volume at most, and the database can store 1200 ten thousand metadata information at most. The invention can be applied to the storage and inquiry of the gene sequencing data.

Description

Large-scale gene sequencing data storage and query system
Technical Field
The invention relates to the technical field of storage and query of gene sequencing data, in particular to a large-scale gene sequencing data storage and query system.
Background
Conventional data storage systems mainly comprise three architectures of DAS, NAS and SAN, but as the number of mass increases, conventional data storage systems have failed to meet the demand for data storage, and thus, distributed storage systems have grown. In recent years, cloud storage systems employing distributed architecture have been rapidly developed as being suitable for many new application scenarios.
Large-scale genome sequencing programs typically produce sequencing data of PB-level, which needs to be recorded in some format on a storage medium internal or external to a computer. Therefore, the rapid storage and retrieval of these genetic sequencing data is a very important issue, as is the security of data storage and the querying of the stored data. However, the existing data storage method cannot meet the high-efficiency requirement for storing the gene sequencing data, and cannot meet the requirement for storing more gene sequencing data in a limited space, so that the storage amount of the gene sequencing data is severely limited by space memory.
Disclosure of Invention
The invention aims to solve the problems that the storage amount of gene sequencing data in the existing method is limited by space memory and the data storage efficiency is low, and provides a system capable of efficiently storing large-scale gene sequencing data and rapidly inquiring the gene sequencing data.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the system comprises a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:
The user login module is used for inputting real-name authentication information by a user, matching the input authentication information with user information stored in a database, if the matching is successful, the user successfully logs in, otherwise, the real-name authentication information needs to be input again for login verification;
the data uploading module is used for uploading the gene sequencing data by the logged-in user and uploading the gene sequencing data uploaded by the user to the cache area;
The user confirms whether the gene sequencing data in the buffer area is correct or not through the interface display module, if so, the user sends a confirmation instruction to the buffer area, then the gene sequencing data in the buffer area is stored into the file storage module, and if not, the gene sequencing data in the buffer area is deleted;
The file storage module sends the stored gene sequencing data to the data compression module;
The data compression module is used for carrying out data compression on the gene sequencing data, uploading the sequencing data file after data compression to the hierarchical directory of the file storage module, deleting the original gene sequencing data corresponding to the sequencing data file in the file storage module, and updating the database;
the data approval module is used for logging in the system by a manager and auditing the sequencing data files uploaded to the database by the user, updating the auditing state of the corresponding sequencing data files in the database after the auditing is passed, permanently storing the sequencing data files passing the auditing, and deleting the sequencing data files which do not pass the auditing in the database if the auditing is not passed;
The data retrieval module is used for retrieving the sequencing data file in the database by a user.
Further, the user information is stored in the database in MD5 encrypted format.
Further, the means for uploading the gene sequencing data by the user comprises single sample uploading and batch sample uploading.
Further, the system also comprises a template uploading module, when the batch samples are uploaded, a user fills single sample information in the batch samples into the template, and the filled template is uploaded to the data uploading module through the template uploading module.
Further, the user fills single sample information in the batch of samples into a template, and the format of the template is an Excel format.
Further, the format of the data-compressed sequencing data file is BAM or VCF.
Further, the data-compressed sequencing data file comprises 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.
Further, the hierarchical relationship of the hierarchical directory in the file storage module is user name, project name, experiment name, sample name and file name.
Further, the specific process of the data compression module for carrying out data compression on the gene sequencing data is as follows:
step1, comparing gene sequencing data to be subjected to data compression with a reference genome, and marking all mutation sites;
step 2, establishing two one-dimensional arrays which are respectively used for recording the starting position and the ending position of a sequence segment where the currently processed gene sequencing data are located;
step 3, constructing a variation position matrix based on marked variation sites and positions of the variation sites in the sequence segments, wherein the dimension of the variation position matrix is M x N, M represents the number of actual samples contained in gene sequencing data to be subjected to data compression, and N represents the length of a reference genome;
The elements in the variation position matrix are 0 or 1,0 indicates that the base of the gene sequencing data at the site is the same as the base corresponding to the reference genome, and 1 indicates that the base of the gene sequencing data at the site is different from the base corresponding to the reference genome;
For the first column of the variation position matrix, sequencing each row of character strings corresponding to the first column according to the reverse prefix dictionary sequence, and so on until each column of the variation position matrix is traversed from left to right, and then obtaining a new variation position matrix;
and 4, compressing the obtained new variation position matrix by adopting run-length coding.
The beneficial effects of the invention are as follows:
The invention provides a large-scale gene sequencing data storage and query system, which has an automatic classification function of gene sequencing data, realizes data storage through a file storage module and a database, and searches the data stored in the database through a data search module. The invention can improve the storage efficiency of the data by compressing the data. Moreover, by compressing the data, the amount of data stored in the limited space can be significantly increased.
The system of the invention can accommodate 6.1PB data volume at most, and the database can store 1200 ten thousand metadata information at most. 51 types of labels can be searched simultaneously, 50 ten thousand data can be displayed at the front end through paging, and the time required by multi-condition searching is not more than 10s.
Drawings
FIG. 1 is a flow chart of a large-scale gene sequencing data storage and query system of the present invention;
FIG. 2 is a data interaction flow diagram;
FIG. 3 is a diagram of a data integration engine;
FIG. 4 is a user login flow chart;
FIG. 5 is a flow chart of a multiple-study data upload;
FIG. 6 is a flow chart of data compression;
fig. 7 is a data retrieval flow chart.
Detailed Description
Detailed description of the inventionthe present embodiment is described with reference to fig. 1,4 and 5. The large-scale gene sequencing data storage and query system of the embodiment comprises a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:
the user login module is used for inputting real-name authentication information by a user, matching the input authentication information with user information stored in a database, if the matching is successful, the user successfully logs in, otherwise, the real-name authentication information needs to be input again for login verification; the login and registration are carried out before the system is used for uploading and retrieving data;
the data uploading module is used for uploading the gene sequencing data by the logged-in user and uploading the gene sequencing data uploaded by the user to the cache area;
The user confirms whether the gene sequencing data in the buffer area is correct or not through the interface display module, if so, the user sends a confirmation instruction to the buffer area, then the gene sequencing data in the buffer area is stored into the file storage module, and if not, the gene sequencing data in the buffer area is deleted;
The file storage module sends the stored gene sequencing data to the data compression module;
The data compression module is used for carrying out data compression on the gene sequencing data, uploading the sequencing data file after data compression to the hierarchical directory of the file storage module, deleting the original gene sequencing data corresponding to the sequencing data file in the file storage module, and updating the database;
the data approval module is used for logging in the system by a manager and auditing the sequencing data files uploaded to the database by the user, updating the auditing state of the corresponding sequencing data files in the database after the auditing is passed, permanently storing the sequencing data files passing the auditing, and deleting the sequencing data files which do not pass the auditing in the database if the auditing is not passed;
In order to ensure that the file data uploaded by the user is true and effective, a manager is required to check the data uploaded by the user, the manager searches the file list, filters out the data information list which is not checked, searches the file through the file path, and checks whether the file format and the content are correct. And providing a data updating function for the data which passes the auditing, and updating the auditing state of the data in the database. Updating the state of the data which is not successfully checked, and clearing the file corresponding to the data in the file system;
The data retrieval module is used for retrieving the sequencing data file in the database by a user.
As shown in fig. 7, the data retrieval module provides fuzzy retrieval and multi-information joint retrieval functions; when in fuzzy retrieval, only a keyword or a part of the keyword is input, submitted to a background interface of the data retrieval module, and the retrieval speed is improved through the database keyword index, and the retrieval data list is returned for the user to use. When multi-information combined retrieval is performed, the system classifies sample information, experimental information, file types, multiple groups of study types and sequencing types, each class provides a plurality of classification labels under the class, the classification labels are displayed to a user in the form of check boxes, the user can select a plurality of labels under different classes, and the system packages the retrieval information selected by the user and the information filled in the retrieval boxes into JSON format data to be submitted to a retrieval interface and returns the JSON format data to the user for use in the form of a list. Meanwhile, the data retrieval module also provides a list file export function, and a user selects a column label of a file to be downloaded to export an Excel file.
Before uploading and searching data, the system of the invention carries out user login registration, the user uploading the data needs to provide real-name authentication information, and the real-name authentication information provided by the user is matched with the user information stored in the MD5 encryption format, so that the security can be improved.
The biological information data storage method and the data visualization program of the invention can write the biological information data into a database in a classified manner, and can carry out data association retrieval and visualization.
The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that the user information is stored in the database in the MD5 encrypted format, so as to improve security.
Other steps and parameters are the same as in the first embodiment.
And a third specific embodiment: this embodiment differs from the first or second embodiment in that the manner in which the user uploads the gene sequencing data includes single sample upload and batch sample upload.
When a single sample is uploaded, basic information of an uploading person is provided, and data meta information, namely auxiliary information of the uploading sample, is filled in and used for being associated with a sequencing data file.
Other steps and parameters are the same as in the first or second embodiment.
The specific embodiment IV is as follows: the difference between the embodiment and one to three embodiments is that the system further includes a template uploading module, when the batch samples are uploaded, the user fills in single sample information in the batch samples into the template, and the filled-in template is uploaded to the data uploading module through the template uploading module.
When uploading in batches, the system provides an uploading template (Excel format), the number of samples in the template and the number of experiments can be automatically generated by users through filling parameters, the template comprises a plurality of single sample information, and after the users fill in, the users upload the template through an uploading template functional unit. And uploading a plurality of sequencing data files to the appointed position of the server, refreshing a front-end page after successful uploading, displaying an uploaded file path list, and submitting the uploaded file path list to a server interface after successful checking by a user. The system provides the functions of analyzing the Excel file and writing the Excel file into the database, and provides the data back display function for a user to check whether the uploaded Excel contains error information or not.
Other steps and parameters are the same as in one to three embodiments.
Fifth embodiment: the embodiment is different from one to four embodiments in that the user fills the single sample information in the batch of samples into the template, and the format of the template is an Excel format.
Other steps and parameters are the same as in one to four embodiments.
Specific embodiment six: this embodiment differs from one to five embodiments in that the format of the data-compressed sequencing data file is BAM or VCF.
Other steps and parameters are the same as in one of the first to fifth embodiments.
Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that the data-compressed sequencing data file includes 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.
Other steps and parameters are the same as in one of the first to sixth embodiments.
Eighth embodiment: the difference between the present embodiment and one of the first to seventh embodiments is that the hierarchical relationship of the hierarchical directory in the file storage module is user name, project name, experiment name, sample name, and file name.
The established temporary file catalogue is associated with the id of each registered user, and when the user uploads the files, meta information corresponding to the files is uploaded, and the meta information comprises detailed information of data samples of data, basic information of sample users, sequencing information and the like. The user name is the user name of the current login of the system, the project name is the experiment target, and a plurality of experiments can be contained under one project; each experiment further comprises a plurality of pieces of sample information, sequencing platform information, data processing and other information; one or more file result information may be included under each sample.
Other steps and parameters are the same as those of one of the first to seventh embodiments.
Detailed description nine: this embodiment will be described with reference to fig. 6. The difference between the embodiment and one to eight embodiments is that the specific process of the data compression module for performing data compression on the gene sequencing data is:
step1, comparing gene sequencing data to be subjected to data compression with a reference genome, and marking all mutation sites;
Step 2, establishing two one-dimensional arrays which are respectively used for recording the starting position and the ending position of a sequence segment where the currently processed gene sequencing data are located; the recorded starting position and the recorded ending position are mainly used for subsequent restoration and decompression calculation of the compressed data;
step 3, constructing a variation position matrix based on marked variation sites and positions of the variation sites in the sequence segments, wherein the dimension of the variation position matrix is M x N, M represents the number of actual samples contained in gene sequencing data to be subjected to data compression, and N represents the length of a reference genome;
The elements in the variation position matrix are 0 or 1,0 indicates that the base of the gene sequencing data at the site is the same as the base corresponding to the reference genome, and 1 indicates that the base of the gene sequencing data at the site is different from the base corresponding to the reference genome;
For the first column of the variation position matrix, sequencing each row of character strings corresponding to the first column according to the reverse prefix dictionary sequence, and so on until each column of the variation position matrix is traversed from left to right, and then obtaining a new variation position matrix;
and 4, compressing the obtained new variation position matrix by adopting run-length coding.
In order to improve the storage efficiency of a file system, namely store more data in a limited space, the system adopts a resequencing compression (SVRZip) algorithm based on a self-indexing structure, and combines a classical byte coding technology to realize lossless data compression with higher compression ratio, thereby relieving the pressure of data storage and data transmission to a certain extent. The compression method (SRVZip algorithm) in this embodiment is further explained as follows:
In combination with the human reference genome, storage compression of sample sequencing data is achieved by recording and compressing sequence difference information of the sample and the reference genome. Comparing all sequencing sequence fragments read to a reference genome by using deBGA, BWA or Bowtie2 genome sequence comparison software, traversing the cigar information in all sequence comparison results, and marking all mutation sites.
The start and end positions of the sequence segments (reads) are recorded through two one-dimensional arrays, and the sequence segments are dynamically added into and withdrawn from the coding sequence when PBWT coding processing is carried out. PBWT introduces two one-dimensional arrays of start [ M ] and end [ M ] for determining that the sequence fragment (reads) corresponding to the currently processed site needs to be dynamically added to the processing queue.
And (3) constructing an original sequence comparison mutation position matrix of M sample numbers (actual sample numbers in a genome plan) and N site numbers (reference genome length), wherein when differential encoding is carried out based on a reference genome, 0 represents that the site base is the same as the base corresponding to the reference genome, and 1 represents that the site base is different from the base corresponding to the reference genome. And performing PBWT matrix conversion operation, and sequencing the character strings of each row according to the reverse prefix dictionary sequence of the current position, and traversing the matrix from left to right until the last position is encountered.
After the PBWT matrix sequence conversion is completed, consecutive 1 s or 0s occur in the vertical direction, so that run-length encoding can be used for compression. Because the number of alphabets of genome data is small, the specific variant sequence information is compressed efficiently by adopting a classical HUFFMAN coding mode.
For partial mass number information in sequencing data, firstly, thinning and homogenizing are carried out, so that different mass number information is converted to be identical and continuous identical digital content is formed, and run-length coding can be adopted for compression. Wherein the differentiated simple mass numbers are also compressed using run-length encoding.
In the embodiment, the lossless data compression with higher compression ratio is performed on the gene sequencing data, so that the pressure of data storage and data transmission is relieved to a certain extent, the high efficiency of file system storage is improved, and more data can be stored in a limited space.
Other steps and parameters are the same as in one to eight of the embodiments.
As shown in fig. 2 and fig. 3, the storage module is a file system on the server, when multiple groups of data are uploaded, the data are uploaded to the cache area first, after the user checks that the user has no errors, a confirmation instruction is sent to the cache area of the server through the background interface, the data in the cache area are stored in the permanent storage area, and meanwhile, the database is updated. The file system can accommodate 6.1PB of data at most. The user can access the application interface through the PC end browser, so that the functions of data query, data uploading and data storage are realized.
When using the system, the user first logs in and registers, as shown in fig. 4, the login is divided into an administrator login and a common user login, the different roles can use different authorities, and the common user can upload the multi-study data file and the metadata information to the system of the invention and can also perform data retrieval. The administrator can audit the uploaded data and determine whether the data can pass the audit or not for the user to retrieve.
After the user logs in successfully, the multi-study file and metadata information can be uploaded to the system, as shown in fig. 5, the uploading flow comprises multi-sample multi-study data uploading and single-sample multi-study data uploading, the user directly inputs information when selecting a single sample, and when selecting multi-sample uploading, the user needs to fill in parameters and download a template, and inputs data information in the template. After the information is filled, uploading the file to the directory of the current user of the file system through the file transmission system, checking whether the file is correct after the file is successfully uploaded, and after the file is confirmed to be correct, successfully uploading the file. For the successfully uploaded file, in order to further improve the space utilization rate of the file system, the system adopts a resequencing compression algorithm (SRVZip) based on a self-indexing structure, further compresses gene sequencing data, rewrites the compressed file into a hierarchical directory, and simultaneously deletes the original file, thereby improving the storage efficiency of the file system, as shown in fig. 6.
When a user uses the search operation, single condition search and multi-condition combined search can be selected, as shown in fig. 7, the single condition only needs to input keywords in an input box, the keywords are matched into a database through an interface after the keywords are submitted, and a search list is returned after the matching is successful. The multi-condition joint search can be used for selecting a plurality of labels under a plurality of different classifications, combining search fields after being selected, submitting the combined search fields to an interface to perform joint query on a plurality of tables in a database, and returning a search list.
The invention provides an automatic classifying, storing and retrieving system for human genetic data, and establishes a comprehensive large-scale genome data management system with functions of automatic classifying, storing, retrieving and visualizing of human genetic data. Meanwhile, a genetic data quality evaluation method of multiple groups of science and a sensitive genetic data identification evaluation system are established, and linkage is established with a biological data network transmission monitoring system, so that the behaviors of illegal output, malicious counterfeiting, expansion and the like of genetic data are effectively screened.
The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims (8)

1. The large-scale gene sequencing data storage and query system is characterized by comprising a user login module, a data compression module, a file storage module, an interface display module, a database, a data uploading module, a data retrieval module and a data approval module, wherein:
The user login module is used for inputting real-name authentication information by a user, matching the input authentication information with user information stored in a database, if the matching is successful, the user successfully logs in, otherwise, the real-name authentication information needs to be input again for login verification;
the data uploading module is used for uploading the gene sequencing data by the logged-in user and uploading the gene sequencing data uploaded by the user to the cache area;
The user confirms whether the gene sequencing data in the buffer area is correct or not through the interface display module, if so, the user sends a confirmation instruction to the buffer area, then the gene sequencing data in the buffer area is stored into the file storage module, and if not, the gene sequencing data in the buffer area is deleted;
The file storage module sends the stored gene sequencing data to the data compression module;
The data compression module is used for carrying out data compression on the gene sequencing data, uploading the sequencing data file after data compression to the hierarchical directory of the file storage module, deleting the original gene sequencing data corresponding to the sequencing data file in the file storage module, and updating the database;
The specific process of the data compression module for carrying out data compression on the gene sequencing data is as follows:
step1, comparing gene sequencing data to be subjected to data compression with a reference genome, and marking all mutation sites;
step 2, establishing two one-dimensional arrays which are respectively used for recording the starting position and the ending position of a sequence segment where the currently processed gene sequencing data are located;
step 3, constructing a variation position matrix based on marked variation sites and positions of the variation sites in the sequence segments, wherein the dimension of the variation position matrix is M x N, M represents the number of actual samples contained in gene sequencing data to be subjected to data compression, and N represents the length of a reference genome;
the elements in the variation position matrix are 0 or 1,0 indicates that the base of the gene sequencing data at the site is the same as the base corresponding to the reference genome, and 1 indicates that the base of the gene sequencing data at the site is different from the base corresponding to the reference genome;
For the first column of the variation position matrix, sequencing each row of character strings corresponding to the first column according to the reverse prefix dictionary sequence, and so on until each column of the variation position matrix is traversed from left to right, and then obtaining a new variation position matrix;
step 4, compressing the obtained new variation position matrix by adopting run length coding;
the data approval module is used for logging in the system by a manager and auditing the sequencing data files uploaded to the database by the user, updating the auditing state of the corresponding sequencing data files in the database after the auditing is passed, permanently storing the sequencing data files passing the auditing, and deleting the sequencing data files which do not pass the auditing in the database if the auditing is not passed;
The data retrieval module is used for retrieving the sequencing data file in the database by a user.
2. The large scale gene sequencing data storage and query system of claim 1, wherein said user information is stored in a database in MD5 encrypted format.
3. The large scale gene sequencing data storage and query system of claim 2, wherein said means for uploading gene sequencing data by a user comprises single sample uploading and bulk sample uploading.
4. The large-scale genetic sequencing data storage and query system of claim 3, further comprising a template upload module, wherein when uploading the batch of samples, the user fills single sample information in the batch of samples into the template, and the filled template is uploaded to the data upload module by the template upload module.
5. The large scale gene sequencing data storage and query system of claim 4, wherein said user fills single sample information in a batch of samples into templates, the templates being in Excel format.
6. The large scale gene sequencing data storage and query system of claim 5, wherein said data compressed sequencing data file is in the format of BAM or VCF.
7. The large scale gene sequencing data storage and query system of claim 6, wherein said data compressed sequencing data files comprise 7 files of histology; the 7 histologies are genomic, transcriptomic, proteomic, metabolomic, microbiome, phenotypic, and exposure histology data, respectively.
8. The large-scale genetic sequencing data storage and query system of claim 7, wherein the hierarchical relationship of hierarchical directories in the file storage module is user name→project name→experiment name→sample name→file name.
CN202111191995.0A 2021-10-13 2021-10-13 Large-scale gene sequencing data storage and query system Active CN113901006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111191995.0A CN113901006B (en) 2021-10-13 2021-10-13 Large-scale gene sequencing data storage and query system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111191995.0A CN113901006B (en) 2021-10-13 2021-10-13 Large-scale gene sequencing data storage and query system

Publications (2)

Publication Number Publication Date
CN113901006A CN113901006A (en) 2022-01-07
CN113901006B true CN113901006B (en) 2024-05-24

Family

ID=79191702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111191995.0A Active CN113901006B (en) 2021-10-13 2021-10-13 Large-scale gene sequencing data storage and query system

Country Status (1)

Country Link
CN (1) CN113901006B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270169B (en) * 2022-05-18 2023-06-13 蔓之研(上海)生物科技有限公司 Decompression method and system for gene data
CN115440305A (en) * 2022-08-29 2022-12-06 新疆碳智干细胞库有限公司 Human genetic resource gene data management system and method
CN115391284B (en) * 2022-10-31 2023-02-03 四川大学华西医院 Method, system and computer readable storage medium for quickly identifying gene data file
CN117033735B (en) * 2023-10-08 2024-01-16 之江实验室 Gene data retrieval method, device, computer equipment and storage medium
CN118072828B (en) * 2024-04-22 2024-07-19 北京百奥利盟软件技术有限公司 Management method, system and storage medium for multi-study experimental process data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462211A (en) * 2014-11-04 2015-03-25 北京诺禾致源生物信息科技有限公司 Re-sequencing data processing method and processing device
CN111625509A (en) * 2020-05-26 2020-09-04 福州数据技术研究院有限公司 Lossless compression method for deep sequencing gene sequence data file

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336916B (en) * 2013-07-05 2016-04-06 中国科学院数学与系统科学研究院 A kind of sequencing sequence mapping method and system
WO2016081712A1 (en) * 2014-11-19 2016-05-26 Bigdatabio, Llc Systems and methods for genomic manipulations and analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462211A (en) * 2014-11-04 2015-03-25 北京诺禾致源生物信息科技有限公司 Re-sequencing data processing method and processing device
CN111625509A (en) * 2020-05-26 2020-09-04 福州数据技术研究院有限公司 Lossless compression method for deep sequencing gene sequence data file

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多级列式索引的海量数据高效查询设计;杨淙钧;艾中良;刘忠麟;李常宝;;软件;20160315(03);全文 *

Also Published As

Publication number Publication date
CN113901006A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN113901006B (en) Large-scale gene sequencing data storage and query system
KR102496954B1 (en) Lossless data reduction by deriving the data from the underlying data elements present in the content-associative sheaves.
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US8799291B2 (en) Forensic index method and apparatus by distributed processing
US20050210054A1 (en) Information management system
Roussev et al. Multi-resolution similarity hashing
US20070174238A1 (en) Indexing and searching numeric ranges
CN106233259A (en) The many storage data from generation to generation of retrieval in decentralized storage networks
Doan et al. Integration of iot streaming data with efficient indexing and storage optimization
US20050219076A1 (en) Information management system
US11916576B2 (en) System and method for effective compression, representation and decompression of diverse tabulated data
JP2008542865A (en) Digital proof bag
WO2021231255A1 (en) Exploiting locality of prime data for efficient retrieval of data that has been losslessly reduced using a prime data sieve
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
US10437825B2 (en) Optimized data condenser and method
CN111460452B (en) An Android malware detection method based on frequency fingerprint extraction
CN107193996B (en) Similar medical record matching and retrieving system
CN115729465A (en) Document decoupling and synthesizing system based on paragraph small file storage
KR20190062551A (en) Method and apparatus for accessing bioinformatics data structured as an access unit
CN114556318A (en) Customizable delimited text compression framework
US20230394015A1 (en) LIST-BASED DATA STORAGE FOR DATA SEARCHPeter
KR102592785B1 (en) Method, computer device, and computer program to provide individual data retrieval service
CN112286874B (en) Time-based file management method
US20240178860A1 (en) System and method for effective compression representation and decompression of diverse tabulated data
Khatri et al. A manual approach for multimedia file carving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant