CN115579061A - Method and device for analyzing genome hic - Google Patents

Method and device for analyzing genome hic Download PDF

Info

Publication number
CN115579061A
CN115579061A CN202211561394.9A CN202211561394A CN115579061A CN 115579061 A CN115579061 A CN 115579061A CN 202211561394 A CN202211561394 A CN 202211561394A CN 115579061 A CN115579061 A CN 115579061A
Authority
CN
China
Prior art keywords
file
assembly
files
genome
hic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211561394.9A
Other languages
Chinese (zh)
Other versions
CN115579061B (en
Inventor
王龙
赵勇
周勋
彭珍
曹斌斌
王静
陶琳娜
李萍
马策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Novogene Technology Co ltd
Original Assignee
Beijing Novogene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Novogene Technology Co ltd filed Critical Beijing Novogene Technology Co ltd
Priority to CN202211561394.9A priority Critical patent/CN115579061B/en
Publication of CN115579061A publication Critical patent/CN115579061A/en
Application granted granted Critical
Publication of CN115579061B publication Critical patent/CN115579061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a device for analyzing genome hic. Wherein, the method comprises the following steps: s1, comparing the assembled genome with a reference genome; s2, comparing the second-generation data of hic sequencing with the assembled genome; s3, clustering the bam files; s4, manually adjusting the original hic files and the assembly files of the clustering files; and S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome; and/or filtering the assembly file and extracting corresponding chromosome composition and name; and/or sorted by chromosome length. By applying the technical scheme of the invention, the name of the newly assembled genome hic after mounting is correspondingly modified according to the sequence name of the reference genome; ordering can be by chromosome length and support individual heatmap mapping of designated chromosomes.

Description

Method and device for analyzing genome hic
Technical Field
The invention relates to the field of genome assembly, in particular to a method and a device for analyzing genome hic.
Background
Genome assembly is generally divided into second-generation sequencing data assembly and third-generation sequencing data assembly, wherein the common assembly software of the second-generation sequencing data is seacodenovo, and the assembly result is a scaffold (scaffold) horizontal genome through combination of small-fragment data and large-fragment data; the assembly software commonly used for the third generation sequencing data was canu or falcon, and the result of assembly was contig (contig) level genome. However, neither of the above two methods of sequencing assembly can assemble a genome at the chromosome level.
The High-C (High-throughput chromosome conformation capture) technology is a High-throughput chromosome conformation capture technology, the principle that the interaction strength in a chromosome is far greater than the interaction strength between chromosomes is utilized, formaldehyde crosslinking and fixing are carried out on tissues, specific restriction enzymes are used for carrying out enzyme digestion on genomes, then biotin labeling and end repairing are carried out, enzyme linking and breaking are carried out again, magnetic beads are used for capturing fragments with biotin labels to carry out High-throughput sequencing, sequencing data are combined with contig or scaffold level genomes to carry out hanging by using alloc or 3 d-dnasoftware, generated Hi C files and assembly files are manually adjusted by juicebox, and chromosome level genomes are finally obtained.
The hic file is a binary file generated by 3 d-dnas software, the file content is hic data interaction matrix information, and the default resolution of the hic file generated by 3 d-dnas is "2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000". The assembly file is a text file corresponding to the hic file, and records length and position information of each contig or scaffold. The assembly file is divided into two parts, the first part is information beginning with ">", and 3 columns are shared in total: the first column is the name of the sequence at the beginning of ">", the second column is the sequence number, and the sequence starts from 1; the third column is the length of the sequence; the second part is information of pure numbers, the column number of each row is not fixed, each row represents a plurality of contigs or scaffolds and is clustered into a large scaffold, the numbers in the row correspond to the sequence number of the first part, and the negative sign represents that the sequence is reversely complemented.
The juicebox is software written by java language for visually processing hic data. After the hic file is imported into the juicebox, the juicebox can perform operations such as breaking, turning and shifting on parts of the hic heat map according to the interaction strength, the adjusted hic file can generate a correspondingly adjusted estimate file, and then a mounted chromosome level genome is obtained.
The genome is assembled to a coining or scaffold level through a second-generation or third-generation sequencing technology, and after hic mounting is carried out, the hic files need to be manually adjusted by using juicebox software, and the wrong connection is corrected to obtain the genome which is correctly connected to a chromosome level. After all Hic mounting is used, in the genome file of the software result, chromosomes begin with Hic _ asm instead of Chr, and the names of the chromosomes are random and are not sorted according to the length; the generated chromosome interaction heatmap also contains all chromosomes, and each chromosome is not separately generated into the heatmap. In the case of the reference genome, if the newly assembled genome is to be matched with the reference genome sequence name, manual modification is required, which is tedious and error-prone.
Disclosure of Invention
The invention aims to provide a method and a device for analyzing genome hic, so as to optimize the method for analyzing the genome hic in the prior art.
To achieve the above object, according to one aspect of the present invention, there is provided a method of genomic hic analysis. The method comprises the following steps: s1, comparing the assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; s2, comparing the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files; s4, manually adjusting the original hic files and the assembly files of the cluster files obtained in the S3 to obtain adjusted assembly. S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or sorting the assembly files obtained in the step S4 according to the chromosome length to obtain sorted assembly files.
Further, the method further comprises S6, and one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the assembly file after sequencing obtained in S5 are used for generating a genome fasta file, and software is used for drawing to obtain the corresponding heat map pictures.
Further, the assembled genome and the reference genome are aligned in S1 using software multiplexer or lastz; and/or aligning the secondary data of hic sequencing with the assembled genome using the software hicup in S2; and/or clustering the bam files obtained in the S2 by using software allhic in the S3.
Further, in S3, clustering is carried out on the bam files obtained in S2 by using software allhic; and/or manually adjusting original HIC files and assembly files of the clustering files by using a software juicebox in S4.
According to another aspect of the present invention, there is provided an apparatus for genomic hic analysis. The device includes: the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured for comparing the second generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured for clustering the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the original hic files and the assembly files obtained in the clustering module to obtain adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sequencing submodule is configured to sequence the aascope files obtained in the adjusting module according to the length of the chromosome to obtain sequenced asset files.
Further, the device further comprises a heat map generation module configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the assembly file after sorting obtained in the result file output module, and use a software drawing to obtain respective corresponding heat map pictures.
Further, the corresponding relation determining module uses software mummer or lastz to compare the assembled genome with the reference genome; and/or aligning the secondary data of hic sequencing with the assembled genome in an alignment module using a software hicup.
Further, clustering the bam files by using a software allhic in a clustering module; and/or manually adjusting original hic files and assembly files of the clustering files by using software Juicebox in an adjusting module.
According to yet another aspect of the present invention, a computer readable storage medium is provided. The storage medium comprises a stored program, wherein the method for performing any of the above-described genomic hic analysis is performed by a device on which the storage medium is located when the program is executed.
According to another aspect of the invention, there is provided a processor for running a program, wherein the program when running performs the method of genomic hic analysis of any one of the above.
By applying the technical scheme of the invention, the method and the device for analyzing the genome hic can correspondingly modify the name of the newly assembled genome hic after mounting according to the sequence name of the reference genome; and simultaneously, the method can be ordered according to the chromosome length and supports individual heat map drawing of the specified chromosome.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 shows a flowchart of a method of genomic hic analysis according to an embodiment of the invention;
fig. 2A and 2B show the reference genome and modified name genome heatmaps in example 1, respectively;
FIG. 3 shows the co-linearity of the reference genome with the modified genome in example 1;
FIG. 4 shows a single chromosome Chr01 heatmap in example 1; and
FIG. 5 shows a heatmap of the chromosomes in example 1 sorted by size.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to optimize the methods for genomic hic analysis in the prior art, the following technical solutions are proposed in the present application.
According to an exemplary embodiment of the present invention, a method for genomic hic analysis is provided. The method comprises the following steps: s1, comparing the assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; s2, comparing the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files; s4, manually adjusting the original hic files and the assembly files of the cluster files obtained in the S3 to obtain adjusted assembly. S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or sorting the aaspace files obtained according to the step S4 according to the chromosome length to obtain sorted assembly files.
By applying the technical scheme of the invention, the method and the device for analyzing the genome hic can correspondingly modify the name of the newly assembled genome hic after mounting according to the sequence name of the reference genome; and simultaneously, the method can be ordered according to the chromosome length and supports individual heat map drawing of the specified chromosome.
In a preferred embodiment of the present application, the method further includes, in S6 and S5, generating a genome fasta file from one or more of the assembly file with the replaced name, the assembly file with the specified chromosome name, and the assembly file after sorting, and drawing with software to obtain respective corresponding heat map pictures, so that the results can be displayed more intuitively.
Preferably, the assembled genome is aligned with the reference genome using software mummer or lastz in S1; and/or aligning the secondary data of hic sequencing with the assembled genome using the software hicup in S2; and/or clustering the bam files obtained in the S2 by using a software allhic in the S3; and/or manually adjusting original hi and assembly files of the cluster files in S4 by using software juicebox. Of course, the software may be replaced by software of a functional class.
In one embodiment of the present application, in order to facilitate implementation of the technical solution of the present application, the inventors further write a related program to assist, for example, in S5, the sequence name in the adjusted assembly file in S4 is replaced with a sequence name consistent with the reference genome by using a program id-assembly.py; filtering the assembly file obtained by the S4 by using a program, school _ ID-assembly.py, and extracting a corresponding chromosome composition and name; py was sorted by chromosome length using the program oder-assembly.
According to an exemplary embodiment of the present invention, an apparatus for genomic hic analysis is provided. The device comprises: the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured to compare the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the original hic files and the assembly files obtained in the clustering module to obtain adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering submodule is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sequencing submodule is configured to sequence the aascope files obtained in the adjusting module according to the length of the chromosome to obtain sequenced asset files.
In a preferred embodiment of the present application, the apparatus further includes a heat map generating module, configured to generate a genome fasta file according to one or more of the assembly file with the replaced name, the assembly file with the specified chromosome name, and the assembly file after sequencing obtained in the result file output module, and use a software drawing to obtain respective corresponding heat map pictures.
Preferably, software mummer or lastz is used in the corresponding relation determining module to compare the assembled genome with the reference genome; and/or using software hicup in the alignment module to align the second generation data of hic sequencing with the assembled genome; and/or clustering the bam files by using software allhic in the clustering module; and/or manually adjusting original HIC files and assembly files of the clustering files by using a software juicebox in an adjusting module.
In a preferred embodiment of the present application, the name replacement submodule replaces the sequence name in the adjusted assembly file with a sequence name identical to the reference genome using the program id-assembly.py; filtering the assembly file by using a program, namely, chorose _ ID-assembly.py in a filtering submodule and extracting a corresponding chromosome composition and a corresponding chromosome name; the sort submodule sorts by chromosome length using the program odd-assembly.
According to an exemplary embodiment of the present invention, a computer-readable storage medium is provided, the storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the above-mentioned method for genomic hic analysis.
According to an exemplary embodiment of the present invention, a processor for executing a program is provided, wherein the program is executed to perform the above method for analyzing genome hic.
The advantageous effects of the present invention will be further described with reference to examples.
Example 1
A method for optimizing hic analysis technology in the process of assembling a certain plant genome is disclosed, referring to FIG. 1, the specific method comprises the following steps:
1. and (3) comparing the target genome (genome) with the reference genome (ref-genome) by using a software multiplexer to obtain a corresponding relation file (id-relation) of sequence names of the assembled genome and the reference genome.
2. And (3) comparing the second generation data of hic sequencing with the assembled genome by using a software hicup to obtain an aligned bam file.
3. And (3) clustering (cluster) the bam files in the step (2) by using a software allhic to obtain original hi files and assembly files of the clustered files.
4. And (3) manually adjusting the original, hic and assembly files in the step (3) by using software juicebox software to obtain adjusted assembly.
5. And for the case of synchronizing the chromosome sequence, replacing the sequence name in the adjusted assembly file with the sequence name which is always the same as the sequence name of the reference genome by using an id-assembly.
6. For the case of drawing the designated chromosome, according to the assembly file in the step 4, using the cache _ ID-assembly.
7. And for the case of chromosome sequencing, sequencing the sequencing files from large to small according to the chromosome length of the sequencing files in the step 4 by using an oder-assembly.
8. And 5, step 6, generating a final genome fasta file by the assembly file in the step 7, and drawing by using R software to obtain corresponding heat map pictures.
Wherein fig. 2A and 2B show the reference genome and modified-name genome heatmaps of example 1, respectively, and fig. 3 shows the modified-name genome and reference genome colinearity; fig. 4 shows a single chromosome Chr01 heat map, and fig. 5 shows the heat map after the chromosomes are sorted by size.
Example 2
This example provides an apparatus for genomic hic analysis. The corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured to compare the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the hic files and the assembly files obtained in the clustering module to obtain the adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sorting sub-module is configured to sort according to the dye length and the assembly file obtained in the adjusting module to obtain a sorted assembly file.
In a preferred embodiment of the present application, the apparatus further includes a heat map generating module, configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name, and the assembly file after sorting obtained in the result file output module, and use a software drawing to obtain the respective corresponding heat map pictures.
Preferably, software mummer or lastz is used in the corresponding relation determining module to compare the assembled genome with the reference genome; and/or using software hicup in the alignment module to align the second generation data of hic sequencing with the assembled genome; and/or clustering the bam files by using software allhic in the clustering module; and/or manually adjusting original HIC files and assembly files of the clustering files by using a software juicebox in an adjusting module.
In a preferred embodiment of the present application, the name replacement submodule replaces the sequence name in the adjusted assembly file with a sequence name consistent with the reference genome using the program id-assembly.py; filtering the assembly file by using a program, namely, chorose _ ID-assembly.py in a filtering submodule and extracting a corresponding chromosome composition and a corresponding chromosome name; the sort submodule sorts by chromosome length using the program odd-assembly.
Example 3
The present embodiments provide a computer readable storage medium comprising a stored program, wherein the program when executed controls a device on which the storage medium is located to perform any of the methods of genomic hic analysis.
The present embodiment also provides a processor for executing a program, wherein the program executes a method for performing any of the above-described genomic hic analysis methods.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, for a specific example in this embodiment, reference may be made to the examples described in the above embodiment and optional implementation, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the invention supports sequence name modification according to a reference genome, can sort according to chromosome length, and simultaneously supports outputting a single chromosome heat map for determining chromosome names.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of genomic hic analysis comprising the steps of:
s1, comparing an assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome;
s2, comparing the second generation data of hic sequencing with the assembled genome to obtain a compared bam file;
s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files;
s4, manually adjusting the original hic file and the assembly file of the cluster file obtained in the S3 to obtain an assembly. And
s5, replacing the sequence name in the assembly file adjusted in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or
Filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or
And sorting the assembly files obtained according to the S4 according to the length of the chromosome to obtain sorted assembly files.
2. The method according to claim 1, further comprising S6, generating a genome fasta file from one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the ordered assembly file obtained in S5, and using a software drawing to obtain the corresponding heat map pictures.
3. The method of claim 1, wherein the assembled genome and the reference genome are aligned in S1 using software mulmer or lastz; and/or
Aligning the secondary data of hic sequencing with the assembled genome in the S2 by using a software hicup; and/or
And in the S3, clustering the bam files obtained in the S2 by using software allhic.
4. The method according to claim 1, wherein in S3, the bam files obtained in S2 are clustered using a software allhic; and/or
And in the S4, manually adjusting original hi and assembly files of the clustering files by using a software juicebox.
5. An apparatus for genomic hic analysis, comprising:
the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome;
the alignment module is configured to align the second generation data of hic sequencing with the assembled genome to obtain an aligned bam file;
the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files;
the adjusting module is configured to manually adjust the hic file and the assembly file which are obtained from the clustering module to obtain the adjusted assembly.hic and assembly files; and
a result file output module, the result file output module comprising:
a name replacement sub-module, configured to replace, according to the correspondence file obtained in the correspondence determination module, the sequence name in the assembly file adjusted in the adjustment module with a sequence name that is consistent with the reference genome, so as to obtain an assembly file with the replaced name;
the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name;
and the sorting sub-module is configured to sort according to the chromosome length according to the aaspace file obtained in the adjusting module to obtain a sorted assembly file.
6. The apparatus according to claim 5, further comprising a heat map generating module configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name, and the assembly file after sorting obtained in the result file outputting module, and use a software drawing to obtain the corresponding heat map picture.
7. The apparatus of claim 5, wherein the correspondence determining module uses software multiplexer or lastz to align the assembled genome with the reference genome; and/or
And the alignment module uses software hicup to align the secondary data of hic sequencing with the assembled genome.
8. The apparatus of claim 5, wherein the clustering module uses a software allhic to cluster the bam files; and/or
And manually adjusting the original hic files and the assembly files of the clustering files by using a software juicebox in the adjusting module.
9. A computer readable storage medium, comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the method of genomic hic analysis according to any one of claims 1 to 4.
10. A processor configured to run a program, wherein the program when executed performs the method of genomic hic analysis of any of claims 1 to 4.
CN202211561394.9A 2022-12-07 2022-12-07 Method and device for analyzing genome hic Active CN115579061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211561394.9A CN115579061B (en) 2022-12-07 2022-12-07 Method and device for analyzing genome hic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211561394.9A CN115579061B (en) 2022-12-07 2022-12-07 Method and device for analyzing genome hic

Publications (2)

Publication Number Publication Date
CN115579061A true CN115579061A (en) 2023-01-06
CN115579061B CN115579061B (en) 2023-04-07

Family

ID=84590780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211561394.9A Active CN115579061B (en) 2022-12-07 2022-12-07 Method and device for analyzing genome hic

Country Status (1)

Country Link
CN (1) CN115579061B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
US20190279740A1 (en) * 2018-01-14 2019-09-12 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
CN112908415A (en) * 2021-02-23 2021-06-04 广西壮族自治区农业科学院 Method for obtaining more accurate chromosome level genome
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN114464260A (en) * 2021-12-29 2022-05-10 天津诺禾致源生物信息科技有限公司 Assembling method and assembling device for genome at chromosome level
CN114464261A (en) * 2022-04-12 2022-05-10 天津诺禾致源生物信息科技有限公司 Method and apparatus for assembling elongated sex chromosomes
CN114566212A (en) * 2022-04-29 2022-05-31 天津诺禾致源生物信息科技有限公司 Method and device for carrying Hi-C genome larger than 10G

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279740A1 (en) * 2018-01-14 2019-09-12 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
WO2021119550A1 (en) * 2019-12-13 2021-06-17 The Broad Institute, Inc. Method for determination of 3d genome architecture with base pair resolution and further uses thereof
CN112908415A (en) * 2021-02-23 2021-06-04 广西壮族自治区农业科学院 Method for obtaining more accurate chromosome level genome
CN113782101A (en) * 2021-11-12 2021-12-10 北京诺禾致源科技股份有限公司 Method and device for removing redundancy of high heterozygous diploid sequence assembly result and application of method and device
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN114464260A (en) * 2021-12-29 2022-05-10 天津诺禾致源生物信息科技有限公司 Assembling method and assembling device for genome at chromosome level
CN114464261A (en) * 2022-04-12 2022-05-10 天津诺禾致源生物信息科技有限公司 Method and apparatus for assembling elongated sex chromosomes
CN114566212A (en) * 2022-04-29 2022-05-31 天津诺禾致源生物信息科技有限公司 Method and device for carrying Hi-C genome larger than 10G

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陶婧芬等: "基于染色质交互数据的基因组组装方法", 《生物技术通报》 *

Also Published As

Publication number Publication date
CN115579061B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN105354239A (en) Configuration data processing model based processing center data stream processing method
CN111143368B (en) Relational database data comparison method and system
CN110109981B (en) Information display method and device for work queue, computer equipment and storage medium
CN106557307B (en) Service data processing method and system
CN106599001A (en) Webpage content acquisition method and system
CN105843788A (en) Report generation method and device
CN107103035A (en) This earth's surface data-updating method and device
CN115579061B (en) Method and device for analyzing genome hic
CN105843899A (en) Automatic big-data analysis method and system capable of simplifying programming
US20080072133A1 (en) Form bundling
CN114566212B (en) Method and device for carrying Hi-C genome larger than 10G
CN110737458A (en) code updating method and related device
CN113407565A (en) Cross-database data query method, device and equipment
CN102073688A (en) Device for converting client form to Web page and method thereof
CN101387960A (en) Method and system for establishing or updating user login information in system integration
US20070192759A1 (en) Diagram editing apparatus
CN114464260B (en) Method and device for assembling chromosome horizontal genome
CN106201696A (en) Method and apparatus for thread
CN113011148A (en) Method, device, equipment and medium for automatically outputting file with specified format
CN114072799A (en) JS component vulnerability detection method and system
CN111898612A (en) OCR recognition method and device combining RPA and AI, equipment and medium
US20050044085A1 (en) Database generation method
CN118013948B (en) Text processing method and system integrating multi-mode interaction
CN110069595A (en) Corpus label determines method, apparatus, electronic equipment and storage medium
CN116578405B (en) Simulation training method and system based on virtualization architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant