CN114566212A - Method and device for carrying Hi-C genome larger than 10G - Google Patents
Method and device for carrying Hi-C genome larger than 10G Download PDFInfo
- Publication number
- CN114566212A CN114566212A CN202210463242.9A CN202210463242A CN114566212A CN 114566212 A CN114566212 A CN 114566212A CN 202210463242 A CN202210463242 A CN 202210463242A CN 114566212 A CN114566212 A CN 114566212A
- Authority
- CN
- China
- Prior art keywords
- files
- file
- assembly
- independent
- hic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a method and a device for the genomic Hi-C mounting of more than 10G. Wherein, the method comprises the following steps: acquiring original hic files and original assembly files of Hi-C data of genome larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level. The method can not only effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice the accuracy of hic interaction.
Description
Technical Field
The application relates to the field of genome assembly, in particular to a method and a device for carrying Hi-C genome with the volume of more than 10G.
Background
Genome assembly is generally divided into second-generation sequencing data assembly and third-generation sequencing data assembly, wherein the common assembly software of the second-generation sequencing data is seacodenovo, and the assembly result is a horizontal genome of a bracket (scaffold) by combining small-fragment data and large-fragment data; the assembly software commonly used for the third generation sequencing data was canu or falcon, and the result of assembly was contig (contig) level genome. Neither of the two sequencing assembly methods described above can assemble the genome to the chromosomal level.
The High-C (High-throughput chromosome conformation capture) technology is a High-throughput chromosome conformation capture technology, by utilizing the principle that the interaction strength in a chromosome is far greater than the interaction strength between chromosomes, formaldehyde crosslinking and fixing are carried out on tissues, specific restriction enzymes are used for carrying out enzyme digestion on genomes, then biotin labeling and end repairing are added, enzyme linkage and breaking are carried out again, fragments with biotin labeling are captured by magnetic beads for High-throughput sequencing, sequencing data are combined with the genomes with contig or scaffold levels and are hung by using 3d-dna software, hic files and assembly files are generated, and the genome with chromosome levels is finally obtained after the assembly box is manually adjusted.
hic the file is a binary file generated by 3d-dna software, and the file content is hic data interaction matrix information. The default resolution of the hic file generated by 3d-dna is "2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000".
The assembly file is a text file corresponding to the hic file, and the assembly file records length and position information of each contig or scaffold. The assembly file is divided into two parts, the first part is information beginning with ">", and 3 columns are shared in total: the first column is the name of the sequence at the beginning of ">", the second column is the sequence number, and the order starts at 1. The third column is the length of the sequence; the second part is information of pure numbers, the column number of each row is not fixed, each row represents a plurality of contigs or scaffolds and is clustered into a large scaffold, the numbers in the row correspond to the sequence number of the first part, and the negative sign represents that the sequence is reversely complemented.
Juicebox is a piece of software written in java language for visualizing hic data. hic after the file is imported into the juicebox, the juicebox can perform operations such as breaking, flipping or shifting on the hic heat map part according to the interaction strength, the hic file after adjustment generates a corresponding assembly file after adjustment, and then the mounted chromosome level genome is obtained.
However, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
Therefore, there is still a need to adjust the mounting method of hic data in genome assembly process with genome larger than 10G to reduce the software delay of manual adjustment of large genome hic file.
Disclosure of Invention
The main purpose of the present application is to provide a method and a device for the Hi-C mounting of genome larger than 10G, so as to reduce the software delay when the large genome hic file is manually adjusted.
To achieve the above object, according to one aspect of the present application, there is provided a method for genomic Hi-C mounting of greater than 10G, the method comprising: acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G; splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files; obtaining a plurality of adjusted independent assembly files respectively according to the plurality of independent assembly files and the plurality of independent hic files; and merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, obtaining the original hic file and the original assembly file for genomic Hi-C data greater than 10G includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; the 3d-dna recognizable non-redundant txt file is input into the 3d-dna software together with the genome at the contig level to obtain the original hic file and the original assembly file.
Further, splitting the original assembly file into a plurality of independent assembly files, and obtaining a corresponding plurality of independent hic files from the plurality of independent assembly files includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; and inputting the independent assembly files into the Juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, obtaining a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively, includes: and importing the independent hic files and the corresponding independent assembly files into the juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
Further, merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
According to a second aspect of the present application, there is provided a device for genomic Hi-C mounting of greater than 10G, the device comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is used for merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Further, the acquisition module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3d-dna software; and the second conversion module is arranged to input the 3d-dna recognizable redundancy-free txt file and the genome at the level of the contig into the 3d-dna software together to obtain an original hic file and an original assembly file.
Further, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit, which is set to renumber each contig from 1 again according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Further, the adjusting module is a joinbox module.
Further, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
According to a third aspect of the present application, there is provided a computer readable storage medium comprising a stored program, wherein the program when executed controls an apparatus on which the computer readable storage medium is located to perform any of the above methods for genome Hi-C mounting greater than 10G.
According to a fourth aspect of the present application, there is provided a processor for executing a program, wherein the program when executed performs any of the above methods for genome Hi-C mounting greater than 10G.
By applying the technical scheme, the large genome is split into a plurality of small genes smaller than 10G and then independently adjusted, and then the adjusted genomes are merged again to finally generate the genome with the chromosome level of the large genome. The method not only can effectively reduce the software delay of manual adjustment of the large genome hic file, but also does not sacrifice the accuracy of hic interaction.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 shows a schematic flow chart of the method for genome Hi-C mounting of more than 10G in a preferred embodiment of the present application.
Detailed Description
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail with reference to examples.
As mentioned in the background, the inventors found in their studies that: for the genome with the genome size smaller than 10G, the hic file is small, the manual adjustment delay of the juicebox is small, and the chromosome level genome can be quickly adjusted, but for the genome with the genome size larger than 10G, the delay of each manual adjustment is obviously increased, and the adjustment process can often cause the crash of the juicebox software. Reducing the interaction resolution can reduce the adjustment delay, but sacrifice the accuracy of the adjustment.
To improve this problem, the inventors have conducted research and analysis on the existing HiC mounting method and found that: and (3) splitting the large genome into a plurality of small genes smaller than 10G, then independently adjusting, then merging the adjusted genomes again, and finally generating the chromosome level genome of the large genome. Not only can effectively reduce the software delay when the large genome hic file is manually adjusted, but also does not sacrifice hic interaction accuracy. On the basis, the inventor further provides a modified scheme of the application.
Example 1
In this example, a method for genomic Hi-C mounting of greater than 10G is presented, the method comprising:
step S101, acquiring an original hic file and an original assembly file of the Hi-C data of the genome larger than 10G;
step S102, splitting an original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the plurality of independent assembly files;
step S103, respectively obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
and step S104, merging the adjusted independent assembly files to obtain a genome assembled at the chromosome level.
In this example, a large genome larger than 10G is split into a plurality of small genomes smaller than 10G, the corresponding HiC data are individually adjusted, and then the adjusted genomes are merged again, so as to finally generate the chromosome level genome of the large genome. The mounting method can effectively reduce software delay when the large genome hic file is manually adjusted, and does not sacrifice hic interaction accuracy.
In a preferred embodiment, the step S101 includes: converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software; the 3d-dna recognizable non-redundant txt file is input into the 3d-dna software together with the contig level genome to get the original hic file and the original assembly file.
The Hi-C data are converted into non-redundant txt files which can be identified by 3 d-dnas software, for example, the non-redundant txt files pass through the juicer software, and then the 3 d-dnas software is used for converting genome (fasta format) files at the level of the contigs obtained by original assembly and the non-redundant txt files which can be identified by the 3 d-dnas software into original hic files and original assembly files. So as to carry out subsequent operations of splitting, individual adjustment and the like.
In a preferred embodiment, the step S102 includes: inputting an original hic file and an original assembly file into juicebox software to manually adjust the interaction relationship; dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the level of the contig to obtain a plurality of adjusted assembly files; numbering each contig from 1 again in each adjusted assembly file according to the sequence numbering sequence to obtain each independent assembly file; inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
The juicebox software is a piece of software written in java language for visualizing hic data. hic the guiicebox can break, flip or shift the hic heat map part according to the interaction strength after importing the guiicebox. In the preferred embodiment, a plurality of adjusted assembly files (here, the adjusted assembly files may be, for example, clusters of one or more chromosomes divided into a cluster) are obtained by manually adjusting the interaction relationship and dividing the interaction relationship into a plurality of clusters by using the hic interaction graph after the interaction adjustment. The reason why the sequence numbers of each contig in each adjusted assembly file are numbered again from 1 is to independently adjust and assemble the divided small genomes.
The following examples illustrate: if the size of the large genome is 50G, and the original hic file and the original assembly file have sequence numbers of different contigs from 1 to 100, and 5 independent assembly files are obtained when the contigs are divided into 5 clusters according to the size, then, if the sequence numbers 1-20, 21-40, 41-60, 71-80 and 81-100 of the contigs are respectively classified into 5 independent assembly files. For each individual file, such as the individual files with sequence numbers 21-40, the sequence numbers are renumbered from 1-20. The sequence numbers of the 5 independent files are numbered from 1-20.
In a preferred embodiment, the step S103 includes: and importing the independent hic files and the corresponding independent assembly files into the juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
The specific operation of performing individual interactive adjustment on the divided independent assembly files is the same as the hic mounting method of genome of less than 10G in the prior art. Namely, each small genome is subjected to interaction adjustment to obtain each adjusted independent assembly file.
In a preferred embodiment, the step S104 includes: combining a plurality of adjusted independent assembly files into a total assembly file; according to the adjusted total length of the contigs in the independently assembled file, numbering all the contigs in sequence from 1 from large to small; the total assembly file was imported into 3 d-dnas software to obtain the genome at the chromosome level.
In the preferred embodiment, a plurality of adjusted independently assembled files are combined, and then numbered from 1 in descending order according to the total length of the combined contig, so as to output a plurality of genomes of chromosome levels with different numbers (i.e., numbered and output in descending order according to the length of the assembled chromosome). The number output in the order of the size of the total length of the contig is a conventional output method, and may be output in other ways, not an exclusive output method. Such as random output, or sequential output from short to long total length of contig, etc.
In another preferred embodiment, 3d-dna software may also be used to generate corresponding hic files (for generating hic heatmaps) and hic heatmaps (for presentation, visual display) for the merged total assembly.
The benefits of the present application will be further illustrated below in conjunction with other embodiments.
Example 2
As shown in FIG. 1, the specific method for adjusting the mounting of large genome (greater than 10G) hic in this example is as follows:
1. inputting the genome assembled by the second generation or third generation sequencing and Hi-C data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic and the assembly file into juicebox software, manually adjusting, preliminarily dividing the interaction graph into 5-10 clusters according to the genome size, and obtaining the adjusted assembly file.
4. And (3) independently generating 5-10 independent assembly files according to the sequence corresponding to the sequence number of 5-10 clusters in the adjusted assembly files, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
5.5-10 independent assembly files respectively generate corresponding hic files by using the Juicebox _ tools software, and respectively importing 5-10 hic files and assembly files into the Juicebox software to be independently adjusted to generate adjusted independent assembly files.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
7. The total assembly file uses 3d-dna software to generate the corresponding hic file, generate hic heatmap, and generate the final chromosome level genome file.
Example 3
1. Inputting the genome contig file with the genome size of 50G and hic data into juicer software to obtain a non-redundant txt file.
2. Inputting the genome assembled by the second generation or third generation sequencing and a non-redundant txt file into 3 d-dnas software to obtain an original hic file and an assembly file.
3. Inputting the original hic file and the assembly file into a Juicebox software, manually adjusting, preliminarily dividing the interaction graph into 10 clusters according to the genome size and the number of chromosomes, and obtaining the adjusted assembly file.
4. And independently generating 10 independent assembly files by 10 clusters in the adjusted assembly files according to the sequence corresponding to the sequence numbers, renaming the internal number of each independent assembly file, and starting the sequence number from 1.
The 5.10 independent assembly files respectively generate corresponding hic files by using the assembly _ tools software, and the 10 hic files and the assembly files are respectively imported into the assembly _ tools software to be independently adjusted, so that the adjusted independent assembly files are generated.
6. And combining the adjusted independent assembly files into a total assembly file, wherein the modified sequence numbers are increased from 1.
The total assembly file generated the corresponding hic file using 3d-dna software, generated the hic heatmap, and generated the final chromosome level 50G genome file.
Table 1: comparison of 50G genome hic adjustment results
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Corresponding to the above manner, the present application also provides a device for genome Hi-C mounting greater than 10G, which is used for implementing the above method for genome Hi-C mounting greater than 10G, and the description of which has been already made is not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
This is further illustrated below in connection with alternative embodiments.
Example 4
This example provides an apparatus for genome Hi-C mounting of greater than 10G, the apparatus comprising: an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G; the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files; an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively; and the merging module is used for merging the adjusted independent assembly files to obtain the genome assembled at the chromosome level.
Optionally, the obtaining module includes: a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt piece recognizable by 3 d-dnas software; and the second conversion module is arranged to input the non-redundant txt file which can be identified by the 3 d-dnas and the genome at the level of the contig into the 3 d-dnas software together to obtain the original hic Warward and the original assembly file.
Optionally, the splitting module comprises: an interaction adjusting unit, which is set to input the original hic file and the original assembly file into the juicebox software to manually adjust the interaction relation; a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files; a first renumbering unit which is set to renumber each contig from 1 according to the sequence numbering order in each adjusted assembly file to obtain each independent assembly file; and the first conversion unit is used for inputting the plurality of independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
Optionally, the adjustment module is a juicebox module.
Optionally, the merging module includes: the file merging unit is used for merging the adjusted independent assembly files into a total assembly file; the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the adjusted total length of the contigs in the independently assembled file; a second conversion unit configured to input the total assembly file into the 3d-dna software, thereby obtaining a genome at a chromosome level.
Example 5
The embodiment provides a computer-readable storage medium, which includes a stored program, and when the program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform any one of the above methods for mounting the genome Hi-C larger than 10G.
The embodiment also provides a processor, which is used for running the program, wherein the program runs to execute any one of the above methods for mounting the genome Hi-C with the size larger than 10G.
From the above description, it can be seen that, for a genome with a genome size larger than 10G, by splitting the assembly file into 5-10 independent assembly files, the delay of the adjustment of the juicebox software is significantly reduced, so that the crash of the juicebox software is avoided, the Hi-C interaction resolution is not reduced, and the accuracy of the adjustment is not sacrificed.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (12)
1. A method for genomic Hi-C mounting of greater than 10G, the method comprising:
acquiring an original hic file and an original assembly file of the Hi-C genome data larger than 10G;
splitting the original assembly file into a plurality of independent assembly files, and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
obtaining a plurality of adjusted independent assembly files according to the plurality of independent assembly files and the plurality of independent hic files;
and merging a plurality of adjusted independent assembly files to obtain a genome assembled at the chromosome level.
2. The method of claim 1, wherein obtaining the original hic file and the original assembly file for genomic Hi-C data greater than 10G comprises:
converting genome and Hi-C data at contig levels greater than 10G into non-redundant txt files recognizable by 3 d-dnas software;
and inputting the non-redundant txt file which can be identified by the 3d-dna and the genome at the level of the contig into the 3d-dna software together to obtain the original hic file and an original assembly file.
3. The method of claim 2, wherein splitting the original assembled file into a plurality of independent assembled files, and obtaining a corresponding plurality of independent hic files from the plurality of independent assembled files comprises:
inputting the original hic file and the original assembly file into a juicebox software to manually adjust the interaction relationship;
dividing the manually adjusted interaction graph into a plurality of clusters according to the size and the number of chromosomes of the genome at the contig level to obtain a plurality of adjusted assembly files;
numbering each contig from 1 again according to the sequence numbering sequence in each adjusted assembly file to obtain each independent assembly file;
inputting the independent assembly files into juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
4. The method according to claim 1, wherein obtaining a plurality of adjusted independent assembly files from a plurality of the independent assembly files and a plurality of the independent hic files, respectively, comprises:
and importing the independent hic files and the corresponding independent assembly files into juicebox software for independent interaction adjustment to obtain a plurality of adjusted independent assembly files.
5. The method of claim 1, wherein merging the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level comprises:
combining a plurality of adjusted independent assembly files into a total assembly file;
numbering each contig in sequence from 1 according to the total length of the contigs in the adjusted independent assembly file from large to small;
inputting the total assembly file into 3 d-dnas software, thereby obtaining the genome at the chromosome level.
6. An apparatus for genomic Hi-C mounting of greater than 10G, comprising:
an acquisition module configured to acquire an original hic file and an original assembly file of genome Hi-C data greater than 10G;
the splitting module is used for splitting the original assembly file into a plurality of independent assembly files and obtaining a plurality of corresponding independent hic files according to the independent assembly files;
an adjustment module configured to obtain a plurality of adjusted independent assembly files from the plurality of independent assembly files and the plurality of independent hic files, respectively;
a merging module configured to merge the plurality of adjusted independent assembly files to obtain a genome assembled at a chromosome level.
7. The apparatus of claim 6, wherein the obtaining module comprises:
a first conversion module configured to convert genome and Hi-C data at contig levels greater than 10G into a non-redundant txt file recognizable by 3 d-dnas software;
and the second conversion module is used for inputting the non-redundant txt file which can be identified by the 3d-dna and the genome at the level of the contig into the 3d-dna software together to obtain the original hic file and an original assembly file.
8. The apparatus of claim 7, wherein the splitting module comprises:
an interaction adjusting unit, which is set to input the original hic file and the original assembly file into a juicebox software to manually adjust the interaction relation;
a cluster dividing unit configured to divide the manually adjusted interaction graph into a plurality of clusters according to the size of the genome at the contig level and the number of chromosomes to obtain a plurality of adjusted assembly files;
a first renumbering unit configured to renumber each of the contigs from 1 again in a sequence numbering order within each of the adjusted assembly files to obtain each of the independent assembly files;
and the first conversion unit is used for inputting the independent assembly files into the juicebox _ tools software to respectively generate a plurality of corresponding independent hic files.
9. The apparatus of claim 6, wherein the adjustment module is a juicebox module.
10. The apparatus of claim 6, wherein the merging module comprises:
the file merging unit is used for merging a plurality of adjusted independent assembly files into a total assembly file;
the second renumbering unit is set to sequentially number the contigs from 1 in descending order according to the total length of the contigs in the adjusted independent assembly file;
a second conversion unit configured to input the total assembly file into 3d-dna software, thereby obtaining the genome at the chromosome level.
11. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the method for genomic Hi-C mount greater than 10G according to any one of claims 1 to 5.
12. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of more than 10G genomic Hi-C mount of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210463242.9A CN114566212B (en) | 2022-04-29 | 2022-04-29 | Method and device for carrying Hi-C genome larger than 10G |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210463242.9A CN114566212B (en) | 2022-04-29 | 2022-04-29 | Method and device for carrying Hi-C genome larger than 10G |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114566212A true CN114566212A (en) | 2022-05-31 |
CN114566212B CN114566212B (en) | 2022-09-16 |
Family
ID=81720943
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210463242.9A Active CN114566212B (en) | 2022-04-29 | 2022-04-29 | Method and device for carrying Hi-C genome larger than 10G |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114566212B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115579061A (en) * | 2022-12-07 | 2023-01-06 | 北京诺禾致源科技股份有限公司 | Method and device for analyzing genome hic |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106471509A (en) * | 2014-06-24 | 2017-03-01 | 巴斯德研究所 | It is derived from method, equipment and the computer program of the chromosome of one or more organisms for assembling |
CN109326323A (en) * | 2018-09-13 | 2019-02-12 | 北京百迈客生物科技有限公司 | A kind of assemble method and device of genome |
US20190279740A1 (en) * | 2018-01-14 | 2019-09-12 | The Broad Institute, Inc. | Linear genome assembly from three dimensional genome structure |
CN112289382A (en) * | 2020-10-28 | 2021-01-29 | 天津诺禾致源生物信息科技有限公司 | Method and device for splitting polyploid genome homologous chromosome and application thereof |
CN112820354A (en) * | 2021-02-25 | 2021-05-18 | 深圳华大基因科技服务有限公司 | Method and device for assembling diploid and storage medium |
WO2021119550A1 (en) * | 2019-12-13 | 2021-06-17 | The Broad Institute, Inc. | Method for determination of 3d genome architecture with base pair resolution and further uses thereof |
CN113782101A (en) * | 2021-11-12 | 2021-12-10 | 北京诺禾致源科技股份有限公司 | Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device |
CN113808668A (en) * | 2021-11-18 | 2021-12-17 | 北京诺禾致源科技股份有限公司 | Method and device for improving genome assembly integrity and application thereof |
CN113918355A (en) * | 2021-11-25 | 2022-01-11 | 天津诺禾致源生物信息科技有限公司 | Genome assembly method and device, computer readable storage medium and processor |
-
2022
- 2022-04-29 CN CN202210463242.9A patent/CN114566212B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106471509A (en) * | 2014-06-24 | 2017-03-01 | 巴斯德研究所 | It is derived from method, equipment and the computer program of the chromosome of one or more organisms for assembling |
US20190279740A1 (en) * | 2018-01-14 | 2019-09-12 | The Broad Institute, Inc. | Linear genome assembly from three dimensional genome structure |
CN109326323A (en) * | 2018-09-13 | 2019-02-12 | 北京百迈客生物科技有限公司 | A kind of assemble method and device of genome |
WO2021119550A1 (en) * | 2019-12-13 | 2021-06-17 | The Broad Institute, Inc. | Method for determination of 3d genome architecture with base pair resolution and further uses thereof |
CN112289382A (en) * | 2020-10-28 | 2021-01-29 | 天津诺禾致源生物信息科技有限公司 | Method and device for splitting polyploid genome homologous chromosome and application thereof |
CN112820354A (en) * | 2021-02-25 | 2021-05-18 | 深圳华大基因科技服务有限公司 | Method and device for assembling diploid and storage medium |
CN113782101A (en) * | 2021-11-12 | 2021-12-10 | 北京诺禾致源科技股份有限公司 | Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device |
CN113808668A (en) * | 2021-11-18 | 2021-12-17 | 北京诺禾致源科技股份有限公司 | Method and device for improving genome assembly integrity and application thereof |
CN113918355A (en) * | 2021-11-25 | 2022-01-11 | 天津诺禾致源生物信息科技有限公司 | Genome assembly method and device, computer readable storage medium and processor |
Non-Patent Citations (1)
Title |
---|
THE CENTER FOR GENOME ARCHITECTURE, BAYLOR COLLEGE OF MEDICINE &: "Genome Assembly Cookbook", 《HTTP://AIDENLAB.ORG/ASSEMBLY/MANUAL_180322.PDF》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115579061A (en) * | 2022-12-07 | 2023-01-06 | 北京诺禾致源科技股份有限公司 | Method and device for analyzing genome hic |
Also Published As
Publication number | Publication date |
---|---|
CN114566212B (en) | 2022-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20080281819A1 (en) | Non-random control data set generation for facilitating genomic data processing | |
US9916286B2 (en) | Reformatting multiple paragraphs of text using the formatting of a sample object by creating multiple candidate combinations and selecting a closest match | |
CN114566212B (en) | Method and device for carrying Hi-C genome larger than 10G | |
CN106774975A (en) | Input method and device | |
US8884148B2 (en) | Systems and methods for transforming character strings and musical input | |
CN102693231A (en) | Method, device and device for confirming atlas according to images from network | |
CN1567829A (en) | General purpose data file conversion method | |
CN102073688A (en) | Device for converting client form to Web page and method thereof | |
CN114464260B (en) | Method and device for assembling chromosome horizontal genome | |
CN109558403B (en) | Data aggregation method and device, computer device and computer readable storage medium | |
CN111833413A (en) | Image processing method, image processing device, electronic equipment and computer readable storage medium | |
CN1591420A (en) | Search system, search program, and personal computer | |
CN115579061B (en) | Method and device for analyzing genome hic | |
US10922498B2 (en) | Method for simultaneously translating language of smart in-vehicle system and related products | |
CN106777262B (en) | High-throughput sequencing data quality filtering method and filtering device | |
CN112612427A (en) | Vehicle stop data processing method and device, storage medium and terminal | |
CN113536766B (en) | Analysis method and device for automobile maintenance records | |
CN109947339B (en) | Drawing method, device and equipment of parabolic cylinder and storage medium | |
CN109815123A (en) | Interface testing case script classification method, device, electronic equipment and medium | |
JP4870732B2 (en) | Information processing apparatus, name identification method, and program | |
CN108205578A (en) | Index generation method and device | |
CN108255835B (en) | Product data generator, generation method and navigation system | |
US20220178814A1 (en) | Method for calculating a density of stem cells in a cell image, electronic device, and storage medium | |
CN115705125A (en) | Control method for vehicle application, electronic device, and medium | |
CN113191164A (en) | Dialect voice synthesis method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |