CN114566212B - Method and device for carrying Hi-C genome larger than 10G - Google Patents

Method and device for carrying Hi-C genome larger than 10G Download PDF

Info

Publication number
CN114566212B
CN114566212B CN202210463242.9A CN202210463242A CN114566212B CN 114566212 B CN114566212 B CN 114566212B CN 202210463242 A CN202210463242 A CN 202210463242A CN 114566212 B CN114566212 B CN 114566212B
Authority
CN
China
Prior art keywords
files
file
assembly
independent
hic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210463242.9A
Other languages
Chinese (zh)
Other versions
CN114566212A (en
Inventor
赵勇
周勋
王龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Novogene Biological Information Technology Co ltd
Original Assignee
Tianjin Novogene Biological Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Novogene Biological Information Technology Co ltd filed Critical Tianjin Novogene Biological Information Technology Co ltd
Priority to CN202210463242.9A priority Critical patent/CN114566212B/en
Publication of CN114566212A publication Critical patent/CN114566212A/en
Application granted granted Critical
Publication of CN114566212B publication Critical patent/CN114566212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a method and a device for the genomic Hi-C mounting of more than 10G. Wherein, the method comprises the following steps: acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level. The method can not only effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice the accuracy of hic interaction.

Description

Method and device for carrying Hi-C genome larger than 10G
Technical Field
The application relates to the field of genome assembly, in particular to a method and a device for carrying Hi-C genome with the weight of more than 10G.
Background
Genome assembly is generally divided into second-generation sequencing data assembly and third-generation sequencing data assembly, wherein the common assembly software of the second-generation sequencing data is seacodenovo, and the assembly result is a horizontal genome of a bracket (scaffold) by combining small-fragment data and large-fragment data; the assembly software commonly used for the third generation sequencing data was canu or falcon, and the result of assembly was contig (contig) level genome. Neither of the two sequencing assembly methods described above can assemble the genome to the chromosomal level.
The High-C (High-throughput chromosome conformation capture) technology is a High-throughput chromosome conformation capture technology, by utilizing the principle that the interaction strength in a chromosome is far greater than the interaction strength between chromosomes, formaldehyde crosslinking and fixing are carried out on tissues, specific restriction enzymes are used for carrying out enzyme digestion on genomes, then biotin labeling and end repairing are added, enzyme linkage and breaking are carried out again, fragments with biotin labeling are captured by magnetic beads for High-throughput sequencing, sequencing data are combined with the genomes with contig or scaffold levels and are hung by using 3d-dna software, hic files and assembly files are generated, and the genome with chromosome levels is finally obtained after the assembly box is manually adjusted.
hic the file is a binary file generated by 3d-dna software, and the file content is hic data interaction matrix information. The default resolution of the hic file generated by 3d-dna is "2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000".
The assembly file is a text file corresponding to the hic file, and the assembly file records length and position information of each contig or scaffold. The assembly file is divided into two parts, the first part is information beginning with ">", and 3 columns are shared in total: the first column indicates the name of the sequence starting with ">", the second column indicates the sequence number, and the sequence starts with 1. The third column is the length of the sequence; the second part is information of pure numbers, the column number of each row is not fixed, each row represents a plurality of contigs or scaffolds and is clustered into a large scaffold, the numbers in the row correspond to the sequence number of the first part, and the negative sign represents that the sequence is reversely complemented.
Juicebox is a piece of software written in java language for visualizing hic data. hic after the file is imported into the juicebox, the juicebox can perform operations such as breaking, flipping or shifting on the hic heat map part according to the interaction strength, the adjusted hic file generates a corresponding adjusted assembly file, and then the mounted chromosome level genome is obtained.
However, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
Therefore, there is still a need to adjust the mounting method of hic data in genome assembly process with genome larger than 10G to reduce the software delay of manual adjustment of large genome hic file.
Disclosure of Invention
The main purpose of the present application is to provide a method and a device for the Hi-C mounting of genome larger than 10G, so as to reduce the software delay when the large genome hic file is manually adjusted.
To achieve the above object, according to one aspect of the present application, there is provided a method for genomic Hi-C mounting of greater than 10G, the method comprising: acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, obtaining the original hic file and the original assembly file for genomic Hi-C data greater than 10G includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; and inputting the txt files which can be identified by 3d-dna and are not redundant and genome at the level of the contig into 3d-dna software together to obtain original hic files and original assembly files.
Further, splitting the original assembly file into a plurality of independent assembly files, and obtaining a corresponding plurality of independent hic files from the plurality of independent assembly files includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, obtaining a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively, includes: and importing the independent hic files and the corresponding independent assembly files into the Juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
Further, merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
According to a second aspect of the present application, there is provided a device for genomic Hi-C mounting of greater than 10G, the device comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is used for merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, the acquisition module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3 d-dnas software; and the second conversion module is arranged to input the 3d-dna recognizable redundancy-free txt file and the genome at the level of the contig into the 3d-dna software together to obtain an original hic file and an original assembly file.
Further, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit, which is set to renumber each contig from 1 again according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, the adjusting module is a juicebox module.
Further, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
According to a third aspect of the present application, there is provided a computer readable storage medium comprising a stored program, wherein the program when executed controls an apparatus on which the computer readable storage medium is located to perform any of the above methods for genome Hi-C mounting greater than 10G.
According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program when executed performs any of the above methods for genome Hi-C mounting greater than 10G.
By applying the technical scheme, the method of the application divides the large genome into a plurality of small genes smaller than 10G, then independently adjusts the small genes, and then combines the adjusted genomes again to finally generate the chromosome level genome of the large genome. The method can not only effectively reduce the software delay of manual adjustment of the large genome hic file, but also does not sacrifice the accuracy of hic interaction.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 shows a schematic flow chart of the genomic Hi-C mounting method of more than 10G in a preferred embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail with reference to examples.
As mentioned in the background, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
To improve this problem, the inventors have conducted research and analysis on the existing HiC mounting method and found that: and (3) splitting the large genome into a plurality of small genes smaller than 10G, then independently adjusting, then merging the adjusted genomes again, and finally generating the chromosome level genome of the large genome. Not only can effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice hic interaction accuracy. On the basis, the inventor further provides a modified scheme of the application.
Example 1
In this example, a method for genomic Hi-C mounting of greater than 10G is presented, the method comprising:
step S101, acquiring original hic files and original assembly files of Hi-C data of genome larger than 10G;
step S102, splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the plurality of independent assembly files;
step S103, respectively obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
and step S104, merging the adjusted independent assembly files to obtain a genome assembled at the chromosome level.
In this example, a large genome larger than 10G is split into a plurality of small genomes smaller than 10G, the corresponding HiC data are individually adjusted, and then the adjusted genomes are merged again, so as to finally generate the chromosome level genome of the large genome. The mounting method can effectively reduce software delay when the large genome hic file is manually adjusted, and does not sacrifice hic interaction accuracy.
In a preferred embodiment, the step S101 includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; the 3d-dna recognizable non-redundant txt file is input into the 3d-dna software together with the contig level genome to get the original hic file and the original assembly file.
The Hi-C data are converted into non-redundant txt files which can be identified by 3 d-dnas software, for example, the non-redundant txt files pass through the juicer software, and then the 3 d-dnas software is used for converting genome (fasta format) files at the level of the contigs obtained by original assembly and the non-redundant txt files which can be identified by the 3 d-dnas software into original hic files and original assembly files. So as to carry out subsequent operations of splitting, individual adjustment and the like.
In a preferred embodiment, the step S102 includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
The juicebox software is a piece of software written in java language for visualizing hic data. hic the guiicebox can break, flip or shift the hic heat map part according to the interaction strength after importing the guiicebox. In the above preferred embodiment, a plurality of adjusted assembly files (here, the adjusted assembly files, for example, contigs which may be one or more chromosomes, are divided into a cluster) are obtained by manually adjusting the interaction relationship, using the hic interaction map after the interaction adjustment, and dividing the interaction relationship into a plurality of clusters according to the distance of the interaction relationship. The reason why the sequence numbers of each contig in each adjusted assembly file are numbered again from 1 is to independently adjust and assemble the divided small genomes.
The following examples illustrate: if the size of the large genome is 50G, and the original hic file and the original assembly file have sequence numbers of different contigs from 1 to 100, and 5 independent assembly files are obtained when the contigs are divided into 5 clusters according to the size, then, if the sequence numbers 1-20, 21-40, 41-60, 71-80 and 81-100 of the contigs are respectively classified into 5 independent assembly files. For each individual file, such as the individual files with sequence numbers 21-40, the sequence numbers are renumbered from 1-20. The sequence numbers of the 5 independent files are numbered from 1-20.
In a preferred embodiment, the step S103 includes: and importing the independent hic files and the corresponding independent assembly files into the juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
The specific operation of performing individual interactive adjustment on the divided independent assembly files is the same as the hic mounting method of genome of less than 10G in the prior art. Namely, each small genome is subjected to interaction adjustment to obtain each adjusted independent assembly file.
In a preferred embodiment, the step S104 includes: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
In the preferred embodiment, a plurality of adjusted independently assembled files are combined, and then numbered from 1 in descending order according to the total length of the combined contig, so as to output a plurality of genomes of chromosome levels with different numbers (i.e., numbered and output in descending order according to the length of the assembled chromosome). The number output in order of the size of the total length of the contigs is a general output method, is not an only necessary output method, and may be output in another method. Such as random output, or sequential output from short to long total length of contig, etc.
In another preferred embodiment, 3d-dna software may also be used to generate corresponding hic files (for generating hic heatmaps) and hic heatmaps (for presentation, visual display) for the merged total assembly.
The benefits of the present application will be further illustrated below in conjunction with other embodiments.
Example 2
As shown in FIG. 1, the specific method for adjusting the mounting of large genome (greater than 10G) hic in this example is as follows:
1. inputting the genome assembled by the second generation or third generation sequencing and Hi-C data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic and the assembly file into juicebox software, manually adjusting, preliminarily dividing the interaction graph into 5-10 clusters according to the genome size, and obtaining the adjusted assembly file.
4. And (3) independently generating 5-10 independent assembly files according to the sequence corresponding to the sequence numbers of 5-10 clusters in the adjusted assembly files, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
5.5-10 independent assembly files respectively generate corresponding hic files by using a juicebox _ tools software, and respectively import 5-10 hic files and assembly files into the juicebox software to be independently adjusted to generate adjusted independent assembly files.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
7. The total assembly file uses 3d-dna software to generate the corresponding hic file, generate hic heatmap, and generate the final chromosome level genome file.
Example 3
1. Inputting the genome contig file with the genome size of 50G and hic data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic file and the assembly file into a Juicebox software, manually adjusting, preliminarily dividing the interaction graph into 10 clusters according to the genome size and the number of chromosomes, and obtaining the adjusted assembly file.
4. And independently generating 10 independent assembly files by 10 clusters in the adjusted assembly files according to the sequence corresponding to the sequence numbers, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
The 5.10 independent assembly files respectively generate corresponding hic files by using the assembly _ tools software, and the 10 hic files and the assembly files are respectively imported into the assembly _ tools software to be independently adjusted, so that the adjusted independent assembly files are generated.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
The total assembly file generated the corresponding hic file using 3d-dna software, generated the hic heatmap, and generated the final chromosome level 50G genome file.
Table 1: comparison of 50G genome hic adjustment results
Figure 827923DEST_PATH_IMAGE001
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application also provides a device for genome Hi-C mounting greater than 10G, which is used for implementing the above method for genome Hi-C mounting greater than 10G, and the description of which has been already made is not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 4
This example provides an apparatus for genome Hi-C mounting of greater than 10G, the apparatus comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is configured to merge the adjusted independent assembly files to obtain a genome assembled at the chromosome level.
Optionally, the obtaining module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt piece recognizable by 3 d-dnas software; and the second conversion module is arranged to input the non-redundant txt file which can be identified by the 3 d-dnas and the genome at the level of the contig into the 3 d-dnas software together to obtain the original hic Warward and the original assembly file.
Optionally, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit, which is set to renumber each contig from 1 again according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Optionally, the adjustment module is a juicebox module.
Optionally, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
Example 5
The embodiment provides a computer-readable storage medium, which includes a stored program, and when the program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform any one of the above methods for mounting the genome Hi-C larger than 10G.
The embodiment also provides a processor, which is used for running the program, wherein the program runs to execute any one of the above methods for mounting the genome Hi-C with the size larger than 10G.
From the above description, it can be seen that, for a genome with a genome size larger than 10G, by splitting the assembly file into 5-10 independent assembly files, the delay of the adjustment of the juicebox software is significantly reduced, so that the crash of the juicebox software is avoided, the Hi-C interaction resolution is not reduced, and the accuracy of the adjustment is not sacrificed.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. A method for genomic Hi-C mounting of greater than 10G, the method comprising:
acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G;
splitting the original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
merging a plurality of the adjusted independent assembly files to obtain a genome assembled at the chromosome level;
wherein, obtaining the original hic file and the original assembly file of the genome Hi-C data larger than 10G comprises:
converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software;
inputting the non-redundant txt file which can be identified by the 3d-dna and the genome at the level of the contig into the 3d-dna software together to obtain the original hic file and an original assembly file;
splitting the original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the plurality of independent assembly files comprises:
inputting the original hic files and the original assembly files into a Juicebox software to manually adjust the interaction relationship;
dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the contig level to obtain a plurality of adjusted assembly files;
numbering each contig from 1 again according to the sequence numbering sequence in each adjusted assembly file to obtain each independent assembly file;
inputting the independent assembly files into juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
2. The method according to claim 1, wherein obtaining a plurality of adjusted independent assembly files from a plurality of the independent assembly files and a plurality of the independent hic files, respectively, comprises:
and importing the independent hic files and the corresponding independent assembly files into juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
3. The method of claim 1, wherein merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises:
combining a plurality of adjusted independent assembly files into a total assembly file;
numbering each contig in sequence from 1 according to the total length of the contigs in the adjusted independent assembly file from large to small;
inputting the total assembly file into 3 d-dnas software, thereby obtaining the genome at the chromosome level.
4. An apparatus for genomic Hi-C mounting of greater than 10G, comprising:
an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G;
the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively;
a merging module configured to merge a plurality of the adjusted independent assembly files to obtain a genome assembled at a chromosome level;
wherein the acquisition module comprises:
a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3 d-dnas software;
a second conversion module, configured to input the non-redundant txt file recognizable by the 3d-dna and the genome at the contig level into the 3d-dna software together to obtain the original hic file and an original assembly file;
wherein the splitting module comprises:
an interaction adjusting unit, which is set to input the original hic file and the original assembly file into a juicebox software to manually adjust the interaction relation;
a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files;
a first renumbering unit configured to renumber each of the contigs from 1 again in a sequence numbering order within each of the adjusted assembly files to obtain each of the independent assembly files;
and the first conversion unit is used for inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
5. The apparatus of claim 4, wherein the adjustment module is a juicebox module.
6. The apparatus of claim 4, wherein the merging module comprises:
the file merging unit is used for merging a plurality of adjusted independent assembly files into a total assembly file;
the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the total length of the contigs in the adjusted independent assembly file;
a second conversion unit configured to input the total assembly file into 3d-dna software, thereby obtaining the genome at the chromosome level.
7. A computer-readable storage medium, comprising a stored program, wherein when the program is run, the computer-readable storage medium is controlled at a device to perform the method for mounting more than 10G of genome Hi-C according to any one of claims 1 to 3.
8. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of more than 10G genomic Hi-C mount of any of claims 1 to 3.
CN202210463242.9A 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G Active CN114566212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210463242.9A CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210463242.9A CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Publications (2)

Publication Number Publication Date
CN114566212A CN114566212A (en) 2022-05-31
CN114566212B true CN114566212B (en) 2022-09-16

Family

ID=81720943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210463242.9A Active CN114566212B (en) 2022-04-29 2022-04-29 Method and device for carrying Hi-C genome larger than 10G

Country Status (1)

Country Link
CN (1) CN114566212B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579061B (en) * 2022-12-07 2023-04-07 北京诺禾致源科技股份有限公司 Method and device for analyzing genome hic

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289382A (en) * 2020-10-28 2021-01-29 天津诺禾致源生物信息科技有限公司 Method and device for splitting polyploid genome homologous chromosome and application thereof
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113918355A (en) * 2021-11-25 2022-01-11 天津诺禾致源生物信息科技有限公司 Genome assembly method and device, computer readable storage medium and processor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2960818A1 (en) * 2014-06-24 2015-12-30 Institut Pasteur Method, device, and computer program for assembling pieces of chromosomes from one or several organisms
US12027236B2 (en) * 2018-01-14 2024-07-02 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
CN109326323B (en) * 2018-09-13 2022-03-18 北京百迈客生物科技有限公司 Genome assembly method and device
CN112820354B (en) * 2021-02-25 2022-07-22 深圳华大基因科技服务有限公司 Method and device for assembling diploid and storage medium
CN113808668B (en) * 2021-11-18 2022-02-18 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN112289382A (en) * 2020-10-28 2021-01-29 天津诺禾致源生物信息科技有限公司 Method and device for splitting polyploid genome homologous chromosome and application thereof
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113918355A (en) * 2021-11-25 2022-01-11 天津诺禾致源生物信息科技有限公司 Genome assembly method and device, computer readable storage medium and processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Genome Assembly Cookbook;THE CENTER FOR GENOME ARCHITECTURE, Baylor College of Medicine &;《http://aidenlab.org/assembly/manual_180322.pdf》;20210329;第3-40页 *

Also Published As

Publication number Publication date
CN114566212A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN114566212B (en) Method and device for carrying Hi-C genome larger than 10G
US20080281819A1 (en) Non-random control data set generation for facilitating genomic data processing
US9916286B2 (en) Reformatting multiple paragraphs of text using the formatting of a sample object by creating multiple candidate combinations and selecting a closest match
CN1240966A (en) Method for dynamically displaying controls in toolbar display based on control usage
CA2576478A1 (en) System and method and apparatus for using uml tools for defining web service bound component applications
CN101661391A (en) Object serializing method, object deserializing method, device and system
CN106774975A (en) Input method and device
CN102693231A (en) Method, device and device for confirming atlas according to images from network
CN111757020B (en) Split screen display method and device, multimedia terminal and storage medium
CN114464260B (en) Method and device for assembling chromosome horizontal genome
CN1567829A (en) General purpose data file conversion method
CN109558403B (en) Data aggregation method and device, computer device and computer readable storage medium
CN111833413A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN1591420A (en) Search system, search program, and personal computer
CN115579061B (en) Method and device for analyzing genome hic
CN116089527A (en) Data verification method, storage medium and device
US10922498B2 (en) Method for simultaneously translating language of smart in-vehicle system and related products
CN113536766B (en) Analysis method and device for automobile maintenance records
JP4870732B2 (en) Information processing apparatus, name identification method, and program
CN109947339B (en) Drawing method, device and equipment of parabolic cylinder and storage medium
CN108255835B (en) Product data generator, generation method and navigation system
CN113360813B (en) Data interaction method, device and equipment of nuclear power design production management platform and storage medium
JP6695538B1 (en) Similar sentence retrieval device and program
CN103871399B (en) Text message player method and device
CN115705125A (en) Control method for vehicle application, electronic device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant