CN114566212A - Method and device for carrying Hi-C genome larger than 10G - Google Patents

Method and device for carrying Hi-C genome larger than 10G Download PDF

Info

Publication number
CN114566212A
CN114566212A CN202210463242.9A CN202210463242A CN114566212A CN 114566212 A CN114566212 A CN 114566212A CN 202210463242 A CN202210463242 A CN 202210463242A CN 114566212 A CN114566212 A CN 114566212A
Authority
CN
China
Prior art keywords
files
file
assembly
independent
hic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210463242.9A
Other languages
Chinese (zh)
Other versions
CN114566212B (en
Inventor
赵勇
周勋
王龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Novogene Biological Information Technology Co ltd
Original Assignee
Tianjin Novogene Biological Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Novogene Biological Information Technology Co ltd filed Critical Tianjin Novogene Biological Information Technology Co ltd
Priority to CN202210463242.9A priority Critical patent/CN114566212B/en
Publication of CN114566212A publication Critical patent/CN114566212A/en
Application granted granted Critical
Publication of CN114566212B publication Critical patent/CN114566212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a method and a device for the genomic Hi-C mounting of more than 10G. Wherein, the method comprises the following steps: acquiring original hic files and original assembly files of Hi-C data of genome larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level. The method can not only effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice the accuracy of hic interaction.

Description

Method and device for carrying Hi-C genome larger than 10G
Technical Field
The application relates to the field of genome assembly, in particular to a method and a device for carrying Hi-C genome with the volume of more than 10G.
Background
Genome assembly is generally divided into second-generation sequencing data assembly and third-generation sequencing data assembly, wherein the common assembly software of the second-generation sequencing data is seacodenovo, and the assembly result is a horizontal genome of a bracket (scaffold) by combining small-fragment data and large-fragment data; the assembly software commonly used for the third generation sequencing data was canu or falcon, and the result of assembly was contig (contig) level genome. Neither of the two sequencing assembly methods described above can assemble the genome to the chromosomal level.
The High-C (High-throughput chromosome conformation capture) technology is a High-throughput chromosome conformation capture technology, by utilizing the principle that the interaction strength in a chromosome is far greater than the interaction strength between chromosomes, formaldehyde crosslinking and fixing are carried out on tissues, specific restriction enzymes are used for carrying out enzyme digestion on genomes, then biotin labeling and end repairing are added, enzyme linkage and breaking are carried out again, fragments with biotin labeling are captured by magnetic beads for High-throughput sequencing, sequencing data are combined with the genomes with contig or scaffold levels and are hung by using 3d-dna software, hic files and assembly files are generated, and the genome with chromosome levels is finally obtained after the assembly box is manually adjusted.
hic the file is a binary file generated by 3d-dna software, and the file content is hic data interaction matrix information. The default resolution of the hic file generated by 3d-dna is "2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000".
The assembly file is a text file corresponding to the hic file, and the assembly file records length and position information of each contig or scaffold. The assembly file is divided into two parts, the first part is information beginning with ">", and 3 columns are shared in total: the first column is the name of the sequence at the beginning of ">", the second column is the sequence number, and the order starts at 1. The third column is the length of the sequence; the second part is information of pure numbers, the column number of each row is not fixed, each row represents a plurality of contigs or scaffolds and is clustered into a large scaffold, the numbers in the row correspond to the sequence number of the first part, and the negative sign represents that the sequence is reversely complemented.
Juicebox is a piece of software written in java language for visualizing hic data. hic after the file is imported into the juicebox, the juicebox can perform operations such as breaking, flipping or shifting on the hic heat map part according to the interaction strength, the hic file after adjustment generates a corresponding assembly file after adjustment, and then the mounted chromosome level genome is obtained.
However, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
Therefore, there is still a need to adjust the mounting method of hic data in genome assembly process with genome larger than 10G to reduce the software delay of manual adjustment of large genome hic file.
Disclosure of Invention
The main purpose of the present application is to provide a method and a device for the Hi-C mounting of genome larger than 10G, so as to reduce the software delay when the large genome hic file is manually adjusted.
To achieve the above object, according to one aspect of the present application, there is provided a method for genomic Hi-C mounting of greater than 10G, the method comprising: acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files respectively according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, obtaining the original hic file and the original assembly file for genomic Hi-C data greater than 10G includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; the 3d-dna recognizable non-redundant txt file is input into the 3d-dna software together with the genome at the contig level to obtain the original hic file and the original assembly file.
Further, splitting the original assembly file into a plurality of independent assembly files, and obtaining a corresponding plurality of independent hic files from the plurality of independent assembly files includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; and inputting the independent assembly files into the Juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, obtaining a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively, includes: and importing the independent hic files and the corresponding independent assembly files into the juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
Further, merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
According to a second aspect of the present application, there is provided a device for genomic Hi-C mounting of greater than 10G, the device comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is used for merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, the acquisition module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3d-dna software; and the second conversion module is arranged to input the 3d-dna recognizable redundancy-free txt file and the genome at the level of the contig into the 3d-dna software together to obtain an original hic file and an original assembly file.
Further, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit, which is set to renumber each contig from 1 again according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, the adjusting module is a joinbox module.
Further, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
According to a third aspect of the present application, there is provided a computer readable storage medium comprising a stored program, wherein the program when executed controls an apparatus on which the computer readable storage medium is located to perform any of the above methods for genome Hi-C mounting greater than 10G.
According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program when executed performs any of the above methods for genome Hi-C mounting greater than 10G.
By applying the technical scheme, the large genome is split into a plurality of small genes smaller than 10G and then independently adjusted, and then the adjusted genomes are merged again to finally generate the genome with the chromosome level of the large genome. The method not only can effectively reduce the software delay of manual adjustment of the large genome hic file, but also does not sacrifice the accuracy of hic interaction.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 shows a schematic flow chart of the method for genome Hi-C mounting of more than 10G in a preferred embodiment of the present application.
Detailed Description
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail with reference to examples.
As mentioned in the background, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
To improve this problem, the inventors have conducted research and analysis on the existing HiC mounting method and found that: and (3) splitting the large genome into a plurality of small genes smaller than 10G, then independently adjusting, then merging the adjusted genomes again, and finally generating the chromosome level genome of the large genome. Not only can effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice hic interaction accuracy. On the basis, the inventor further provides a modified scheme of the application.
Example 1
In this example, a method for genomic Hi-C mounting of greater than 10G is presented, the method comprising:
step S101, acquiring an original hic file and an original assembly file of the Hi-C data of the genome larger than 10G;
step S102, splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the plurality of independent assembly files;
step S103, respectively obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
and step S104, merging the adjusted independent assembly files to obtain a genome assembled at the chromosome level.
In this example, a large genome larger than 10G is split into a plurality of small genomes smaller than 10G, the corresponding HiC data are individually adjusted, and then the adjusted genomes are merged again, so as to finally generate the chromosome level genome of the large genome. The mounting method can effectively reduce software delay when the large genome hic file is manually adjusted, and does not sacrifice hic interaction accuracy.
In a preferred embodiment, the step S101 includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; the 3d-dna recognizable non-redundant txt file is input into the 3d-dna software together with the contig level genome to get the original hic file and the original assembly file.
The Hi-C data are converted into non-redundant txt files which can be identified by 3 d-dnas software, for example, the non-redundant txt files pass through the juicer software, and then the 3 d-dnas software is used for converting genome (fasta format) files at the level of the contigs obtained by original assembly and the non-redundant txt files which can be identified by the 3 d-dnas software into original hic files and original assembly files. So as to carry out subsequent operations of splitting, individual adjustment and the like.
In a preferred embodiment, the step S102 includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
The juicebox software is a piece of software written in java language for visualizing hic data. hic the guiicebox can break, flip or shift the hic heat map part according to the interaction strength after importing the guiicebox. In the preferred embodiment, a plurality of adjusted assembly files (here, the adjusted assembly files may be, for example, clusters of one or more chromosomes divided into a cluster) are obtained by manually adjusting the interaction relationship and dividing the interaction relationship into a plurality of clusters by using the hic interaction graph after the interaction adjustment. The reason why the sequence numbers of each contig in each adjusted assembly file are numbered again from 1 is to independently adjust and assemble the divided small genomes.
The following examples illustrate: if the size of the large genome is 50G, and the original hic file and the original assembly file have sequence numbers of different contigs from 1 to 100, and 5 independent assembly files are obtained when the contigs are divided into 5 clusters according to the size, then, if the sequence numbers 1-20, 21-40, 41-60, 71-80 and 81-100 of the contigs are respectively classified into 5 independent assembly files. For each individual file, such as the individual files with sequence numbers 21-40, the sequence numbers are renumbered from 1-20. The sequence numbers of the 5 independent files are numbered from 1-20.
In a preferred embodiment, the step S103 includes: and importing the independent hic files and the corresponding independent assembly files into the juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
The specific operation of performing individual interactive adjustment on the divided independent assembly files is the same as the hic mounting method of genome of less than 10G in the prior art. Namely, each small genome is subjected to interaction adjustment to obtain each adjusted independent assembly file.
In a preferred embodiment, the step S104 includes: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
In the preferred embodiment, a plurality of adjusted independently assembled files are combined, and then numbered from 1 in descending order according to the total length of the combined contig, so as to output a plurality of genomes of chromosome levels with different numbers (i.e., numbered and output in descending order according to the length of the assembled chromosome). The number output in the order of the size of the total length of the contig is a conventional output method, and may be output in other ways, not an exclusive output method. Such as random output, or sequential output from short to long total length of contig, etc.
In another preferred embodiment, 3d-dna software may also be used to generate corresponding hic files (for generating hic heatmaps) and hic heatmaps (for presentation, visual display) for the merged total assembly.
The benefits of the present application will be further illustrated below in conjunction with other embodiments.
Example 2
As shown in FIG. 1, the specific method for adjusting the mounting of large genome (greater than 10G) hic in this example is as follows:
1. inputting the genome assembled by the second generation or third generation sequencing and Hi-C data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic and the assembly file into juicebox software, manually adjusting, preliminarily dividing the interaction graph into 5-10 clusters according to the genome size, and obtaining the adjusted assembly file.
4. And (3) independently generating 5-10 independent assembly files according to the sequence corresponding to the sequence number of 5-10 clusters in the adjusted assembly files, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
5.5-10 independent assembly files respectively generate corresponding hic files by using the Juicebox _ tools software, and respectively importing 5-10 hic files and assembly files into the Juicebox software to be independently adjusted to generate adjusted independent assembly files.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
7. The total assembly file uses 3d-dna software to generate the corresponding hic file, generate hic heatmap, and generate the final chromosome level genome file.
Example 3
1. Inputting the genome contig file with the genome size of 50G and hic data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic file and the assembly file into a Juicebox software, manually adjusting, preliminarily dividing the interaction graph into 10 clusters according to the genome size and the number of chromosomes, and obtaining the adjusted assembly file.
4. And independently generating 10 independent assembly files by 10 clusters in the adjusted assembly files according to the sequence corresponding to the sequence numbers, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
The 5.10 independent assembly files respectively generate corresponding hic files by using the assembly _ tools software, and the 10 hic files and the assembly files are respectively imported into the assembly _ tools software to be independently adjusted, so that the adjusted independent assembly files are generated.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
The total assembly file generated the corresponding hic file using 3d-dna software, generated the hic heatmap, and generated the final chromosome level 50G genome file.
Table 1: comparison of 50G genome hic adjustment results
Figure 827923DEST_PATH_IMAGE001
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application also provides a device for genome Hi-C mounting greater than 10G, which is used for implementing the above method for genome Hi-C mounting greater than 10G, and the description of which has been already made is not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 4
This example provides an apparatus for genome Hi-C mounting of greater than 10G, the apparatus comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is used for merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Optionally, the obtaining module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt piece recognizable by 3 d-dnas software; and the second conversion module is arranged to input the non-redundant txt file which can be identified by the 3 d-dnas and the genome at the level of the contig into the 3 d-dnas software together to obtain the original hic Warward and the original assembly file.
Optionally, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit which is set to renumber each contig from 1 according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Optionally, the adjustment module is a juicebox module.
Optionally, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
Example 5
The embodiment provides a computer-readable storage medium, which includes a stored program, and when the program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform any one of the above methods for mounting the genome Hi-C larger than 10G.
The embodiment also provides a processor, which is used for running the program, wherein the program runs to execute any one of the above methods for mounting the genome Hi-C with the size larger than 10G.
From the above description, it can be seen that, for a genome with a genome size larger than 10G, by splitting the assembly file into 5-10 independent assembly files, the delay of the adjustment of the juicebox software is significantly reduced, so that the crash of the juicebox software is avoided, the Hi-C interaction resolution is not reduced, and the accuracy of the adjustment is not sacrificed.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. A method for genomic Hi-C mounting of greater than 10G, the method comprising:
acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G;
splitting the original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
and merging a plurality of adjusted independent assembly files to obtain a genome assembled at the chromosome level.
2. The method of claim 1, wherein obtaining the original hic file and the original assembly file for genomic Hi-C data greater than 10G comprises:
converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software;
and inputting the non-redundant txt file which can be identified by the 3d-dna and the genome at the level of the contig into the 3d-dna software together to obtain the original hic file and an original assembly file.
3. The method of claim 2, wherein splitting the original assembled file into a plurality of independent assembled files, and obtaining a corresponding plurality of independent hic files from the plurality of independent assembled files comprises:
inputting the original hic file and the original assembly file into a juicebox software to manually adjust the interaction relationship;
dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the contig level to obtain a plurality of adjusted assembly files;
numbering each contig from 1 again according to the sequence numbering sequence in each adjusted assembly file to obtain each independent assembly file;
inputting the independent assembly files into juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
4. The method according to claim 1, wherein obtaining a plurality of adjusted independent assembly files from a plurality of the independent assembly files and a plurality of the independent hic files, respectively, comprises:
and importing the independent hic files and the corresponding independent assembly files into juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
5. The method of claim 1, wherein merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises:
combining a plurality of adjusted independent assembly files into a total assembly file;
numbering each contig in sequence from 1 according to the total length of the contigs in the adjusted independent assembly file from large to small;
inputting the total assembly file into 3 d-dnas software, thereby obtaining the genome at the chromosome level.
6. An apparatus for genomic Hi-C mounting of greater than 10G, comprising:
an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G;
the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively;
a merging module configured to merge the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level.
7. The apparatus of claim 6, wherein the obtaining module comprises:
a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3 d-dnas software;
and the second conversion module is used for inputting the non-redundant txt file which can be identified by the 3d-dna and the genome at the level of the contig into the 3d-dna software together to obtain the original hic file and an original assembly file.
8. The apparatus of claim 7, wherein the splitting module comprises:
an interaction adjusting unit, which is set to input the original hic file and the original assembly file into a juicebox software to manually adjust the interaction relation;
a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files;
a first renumbering unit configured to renumber each of the contigs from 1 again in a sequence numbering order within each of the adjusted assembly files to obtain each of the independent assembly files;
and the first conversion unit is used for inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
9. The apparatus of claim 6, wherein the adjustment module is a juicebox module.
10. The apparatus of claim 6, wherein the merging module comprises:
the file merging unit is used for merging a plurality of adjusted independent assembly files into a total assembly file;
the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the total length of the contigs in the adjusted independent assembly file;
a second conversion unit configured to input the total assembly file into 3d-dna software, thereby obtaining the genome at the chromosome level.
11. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the method for genomic Hi-C mount greater than 10G according to any one of claims 1 to 5.
12. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of more than 10G genomic Hi-C mount of any of claims 1 to 5.
CN202210463242.9A 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G Active CN114566212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210463242.9A CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210463242.9A CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Publications (2)

Publication Number Publication Date
CN114566212A true CN114566212A (en) 2022-05-31
CN114566212B CN114566212B (en) 2022-09-16

Family

ID=81720943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210463242.9A Active CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Country Status (1)

Country Link
CN (1) CN114566212B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579061A (en) * 2022-12-07 2023-01-06 北京诺禾致源科技股份有限公司 Method and device for analyzing genome hic

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106471509A (en) * 2014-06-24 2017-03-01 巴斯德研究所 It is derived from method, equipment and the computer program of the chromosome of one or more organisms for assembling
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
US20190279740A1 (en) * 2018-01-14 2019-09-12 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
CN112289382A (en) * 2020-10-28 2021-01-29 天津诺禾致源生物信息科技有限公司 Method and device for splitting polyploid genome homologous chromosome and application thereof
CN112820354A (en) * 2021-02-25 2021-05-18 深圳华大基因科技服务有限公司 Method and device for assembling diploid and storage medium
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN113918355A (en) * 2021-11-25 2022-01-11 天津诺禾致源生物信息科技有限公司 Genome assembly method and device, computer readable storage medium and processor

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106471509A (en) * 2014-06-24 2017-03-01 巴斯德研究所 It is derived from method, equipment and the computer program of the chromosome of one or more organisms for assembling
US20190279740A1 (en) * 2018-01-14 2019-09-12 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN112289382A (en) * 2020-10-28 2021-01-29 天津诺禾致源生物信息科技有限公司 Method and device for splitting polyploid genome homologous chromosome and application thereof
CN112820354A (en) * 2021-02-25 2021-05-18 深圳华大基因科技服务有限公司 Method and device for assembling diploid and storage medium
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN113918355A (en) * 2021-11-25 2022-01-11 天津诺禾致源生物信息科技有限公司 Genome assembly method and device, computer readable storage medium and processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
THE CENTER FOR GENOME ARCHITECTURE, BAYLOR COLLEGE OF MEDICINE &: "Genome Assembly Cookbook", 《HTTP://AIDENLAB.ORG/ASSEMBLY/MANUAL_180322.PDF》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579061A (en) * 2022-12-07 2023-01-06 北京诺禾致源科技股份有限公司 Method and device for analyzing genome hic

Also Published As

Publication number Publication date
CN114566212B (en) 2022-09-16

Similar Documents

Publication Publication Date Title
US20080281819A1 (en) Non-random control data set generation for facilitating genomic data processing
US9916286B2 (en) Reformatting multiple paragraphs of text using the formatting of a sample object by creating multiple candidate combinations and selecting a closest match
CN114566212B (en) Method and device for carrying Hi-C genome larger than 10G
CN106774975A (en) Input method and device
US8884148B2 (en) Systems and methods for transforming character strings and musical input
CN102693231A (en) Method, device and device for confirming atlas according to images from network
CN1567829A (en) General purpose data file conversion method
CN102073688A (en) Device for converting client form to Web page and method thereof
CN114464260B (en) Method and device for assembling chromosome horizontal genome
CN109558403B (en) Data aggregation method and device, computer device and computer readable storage medium
CN111833413A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN1591420A (en) Search system, search program, and personal computer
CN115579061B (en) Method and device for analyzing genome hic
US10922498B2 (en) Method for simultaneously translating language of smart in-vehicle system and related products
CN106777262B (en) High-throughput sequencing data quality filtering method and filtering device
CN112612427A (en) Vehicle stop data processing method and device, storage medium and terminal
CN113536766B (en) Analysis method and device for automobile maintenance records
CN109947339B (en) Drawing method, device and equipment of parabolic cylinder and storage medium
CN109815123A (en) Interface testing case script classification method, device, electronic equipment and medium
JP4870732B2 (en) Information processing apparatus, name identification method, and program
CN108205578A (en) Index generation method and device
CN108255835B (en) Product data generator, generation method and navigation system
US20220178814A1 (en) Method for calculating a density of stem cells in a cell image, electronic device, and storage medium
CN115705125A (en) Control method for vehicle application, electronic device, and medium
CN113191164A (en) Dialect voice synthesis method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant