CN115579061A

CN115579061A - Method and device for analyzing genome hic

Info

Publication number: CN115579061A
Application number: CN202211561394.9A
Authority: CN
Inventors: 王龙; 赵勇; 周勋; 彭珍; 曹斌斌; 王静; 陶琳娜; 李萍; 马策
Original assignee: Beijing Novogene Technology Co ltd
Current assignee: Beijing Novogene Technology Co ltd
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-01-06
Anticipated expiration: 2042-12-07
Also published as: CN115579061B

Abstract

The invention discloses a method and a device for analyzing genome hic. Wherein, the method comprises the following steps: s1, comparing the assembled genome with a reference genome; s2, comparing the second-generation data of hic sequencing with the assembled genome; s3, clustering the bam files; s4, manually adjusting the original hic files and the assembly files of the clustering files; and S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome; and/or filtering the assembly file and extracting corresponding chromosome composition and name; and/or sorted by chromosome length. By applying the technical scheme of the invention, the name of the newly assembled genome hic after mounting is correspondingly modified according to the sequence name of the reference genome; ordering can be by chromosome length and support individual heatmap mapping of designated chromosomes.

Description

Method and device for analyzing genome hic

Technical Field

The invention relates to the field of genome assembly, in particular to a method and a device for analyzing genome hic.

Background

Genome assembly is generally divided into second-generation sequencing data assembly and third-generation sequencing data assembly, wherein the common assembly software of the second-generation sequencing data is seacodenovo, and the assembly result is a scaffold (scaffold) horizontal genome through combination of small-fragment data and large-fragment data; the assembly software commonly used for the third generation sequencing data was canu or falcon, and the result of assembly was contig (contig) level genome. However, neither of the above two methods of sequencing assembly can assemble a genome at the chromosome level.

The High-C (High-throughput chromosome conformation capture) technology is a High-throughput chromosome conformation capture technology, the principle that the interaction strength in a chromosome is far greater than the interaction strength between chromosomes is utilized, formaldehyde crosslinking and fixing are carried out on tissues, specific restriction enzymes are used for carrying out enzyme digestion on genomes, then biotin labeling and end repairing are carried out, enzyme linking and breaking are carried out again, magnetic beads are used for capturing fragments with biotin labels to carry out High-throughput sequencing, sequencing data are combined with contig or scaffold level genomes to carry out hanging by using alloc or 3 d-dnasoftware, generated Hi C files and assembly files are manually adjusted by juicebox, and chromosome level genomes are finally obtained.

The hic file is a binary file generated by 3 d-dnas software, the file content is hic data interaction matrix information, and the default resolution of the hic file generated by 3 d-dnas is "2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 1000". The assembly file is a text file corresponding to the hic file, and records length and position information of each contig or scaffold. The assembly file is divided into two parts, the first part is information beginning with ">", and 3 columns are shared in total: the first column is the name of the sequence at the beginning of ">", the second column is the sequence number, and the sequence starts from 1; the third column is the length of the sequence; the second part is information of pure numbers, the column number of each row is not fixed, each row represents a plurality of contigs or scaffolds and is clustered into a large scaffold, the numbers in the row correspond to the sequence number of the first part, and the negative sign represents that the sequence is reversely complemented.

The juicebox is software written by java language for visually processing hic data. After the hic file is imported into the juicebox, the juicebox can perform operations such as breaking, turning and shifting on parts of the hic heat map according to the interaction strength, the adjusted hic file can generate a correspondingly adjusted estimate file, and then a mounted chromosome level genome is obtained.

The genome is assembled to a coining or scaffold level through a second-generation or third-generation sequencing technology, and after hic mounting is carried out, the hic files need to be manually adjusted by using juicebox software, and the wrong connection is corrected to obtain the genome which is correctly connected to a chromosome level. After all Hic mounting is used, in the genome file of the software result, chromosomes begin with Hic _ asm instead of Chr, and the names of the chromosomes are random and are not sorted according to the length; the generated chromosome interaction heatmap also contains all chromosomes, and each chromosome is not separately generated into the heatmap. In the case of the reference genome, if the newly assembled genome is to be matched with the reference genome sequence name, manual modification is required, which is tedious and error-prone.

Disclosure of Invention

The invention aims to provide a method and a device for analyzing genome hic, so as to optimize the method for analyzing the genome hic in the prior art.

To achieve the above object, according to one aspect of the present invention, there is provided a method of genomic hic analysis. The method comprises the following steps: s1, comparing the assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; s2, comparing the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files; s4, manually adjusting the original hic files and the assembly files of the cluster files obtained in the S3 to obtain adjusted assembly. S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or sorting the assembly files obtained in the step S4 according to the chromosome length to obtain sorted assembly files.

Further, the method further comprises S6, and one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the assembly file after sequencing obtained in S5 are used for generating a genome fasta file, and software is used for drawing to obtain the corresponding heat map pictures.

Further, the assembled genome and the reference genome are aligned in S1 using software multiplexer or lastz; and/or aligning the secondary data of hic sequencing with the assembled genome using the software hicup in S2; and/or clustering the bam files obtained in the S2 by using software allhic in the S3.

Further, in S3, clustering is carried out on the bam files obtained in S2 by using software allhic; and/or manually adjusting original HIC files and assembly files of the clustering files by using a software juicebox in S4.

According to another aspect of the present invention, there is provided an apparatus for genomic hic analysis. The device includes: the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured for comparing the second generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured for clustering the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the original hic files and the assembly files obtained in the clustering module to obtain adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sequencing submodule is configured to sequence the aascope files obtained in the adjusting module according to the length of the chromosome to obtain sequenced asset files.

Further, the device further comprises a heat map generation module configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the assembly file after sorting obtained in the result file output module, and use a software drawing to obtain respective corresponding heat map pictures.

Further, the corresponding relation determining module uses software mummer or lastz to compare the assembled genome with the reference genome; and/or aligning the secondary data of hic sequencing with the assembled genome in an alignment module using a software hicup.

Further, clustering the bam files by using a software allhic in a clustering module; and/or manually adjusting original hic files and assembly files of the clustering files by using software Juicebox in an adjusting module.

According to yet another aspect of the present invention, a computer readable storage medium is provided. The storage medium comprises a stored program, wherein the method for performing any of the above-described genomic hic analysis is performed by a device on which the storage medium is located when the program is executed.

According to another aspect of the invention, there is provided a processor for running a program, wherein the program when running performs the method of genomic hic analysis of any one of the above.

By applying the technical scheme of the invention, the method and the device for analyzing the genome hic can correspondingly modify the name of the newly assembled genome hic after mounting according to the sequence name of the reference genome; and simultaneously, the method can be ordered according to the chromosome length and supports individual heat map drawing of the specified chromosome.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 shows a flowchart of a method of genomic hic analysis according to an embodiment of the invention;

fig. 2A and 2B show the reference genome and modified name genome heatmaps in example 1, respectively;

FIG. 3 shows the co-linearity of the reference genome with the modified genome in example 1;

FIG. 4 shows a single chromosome Chr01 heatmap in example 1; and

FIG. 5 shows a heatmap of the chromosomes in example 1 sorted by size.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to optimize the methods for genomic hic analysis in the prior art, the following technical solutions are proposed in the present application.

According to an exemplary embodiment of the present invention, a method for genomic hic analysis is provided. The method comprises the following steps: s1, comparing the assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; s2, comparing the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files; s4, manually adjusting the original hic files and the assembly files of the cluster files obtained in the S3 to obtain adjusted assembly. S5, replacing the sequence name in the adjusted assembly file in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or sorting the aaspace files obtained according to the step S4 according to the chromosome length to obtain sorted assembly files.

In a preferred embodiment of the present application, the method further includes, in S6 and S5, generating a genome fasta file from one or more of the assembly file with the replaced name, the assembly file with the specified chromosome name, and the assembly file after sorting, and drawing with software to obtain respective corresponding heat map pictures, so that the results can be displayed more intuitively.

Preferably, the assembled genome is aligned with the reference genome using software mummer or lastz in S1; and/or aligning the secondary data of hic sequencing with the assembled genome using the software hicup in S2; and/or clustering the bam files obtained in the S2 by using a software allhic in the S3; and/or manually adjusting original hi and assembly files of the cluster files in S4 by using software juicebox. Of course, the software may be replaced by software of a functional class.

In one embodiment of the present application, in order to facilitate implementation of the technical solution of the present application, the inventors further write a related program to assist, for example, in S5, the sequence name in the adjusted assembly file in S4 is replaced with a sequence name consistent with the reference genome by using a program id-assembly.py; filtering the assembly file obtained by the S4 by using a program, school _ ID-assembly.py, and extracting a corresponding chromosome composition and name; py was sorted by chromosome length using the program oder-assembly.

According to an exemplary embodiment of the present invention, an apparatus for genomic hic analysis is provided. The device comprises: the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured to compare the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the original hic files and the assembly files obtained in the clustering module to obtain adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering submodule is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sequencing submodule is configured to sequence the aascope files obtained in the adjusting module according to the length of the chromosome to obtain sequenced asset files.

In a preferred embodiment of the present application, the apparatus further includes a heat map generating module, configured to generate a genome fasta file according to one or more of the assembly file with the replaced name, the assembly file with the specified chromosome name, and the assembly file after sequencing obtained in the result file output module, and use a software drawing to obtain respective corresponding heat map pictures.

Preferably, software mummer or lastz is used in the corresponding relation determining module to compare the assembled genome with the reference genome; and/or using software hicup in the alignment module to align the second generation data of hic sequencing with the assembled genome; and/or clustering the bam files by using software allhic in the clustering module; and/or manually adjusting original HIC files and assembly files of the clustering files by using a software juicebox in an adjusting module.

In a preferred embodiment of the present application, the name replacement submodule replaces the sequence name in the adjusted assembly file with a sequence name identical to the reference genome using the program id-assembly.py; filtering the assembly file by using a program, namely, chorose _ ID-assembly.py in a filtering submodule and extracting a corresponding chromosome composition and a corresponding chromosome name; the sort submodule sorts by chromosome length using the program odd-assembly.

According to an exemplary embodiment of the present invention, a computer-readable storage medium is provided, the storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the above-mentioned method for genomic hic analysis.

According to an exemplary embodiment of the present invention, a processor for executing a program is provided, wherein the program is executed to perform the above method for analyzing genome hic.

The advantageous effects of the present invention will be further described with reference to examples.

Example 1

A method for optimizing hic analysis technology in the process of assembling a certain plant genome is disclosed, referring to FIG. 1, the specific method comprises the following steps:

1. and (3) comparing the target genome (genome) with the reference genome (ref-genome) by using a software multiplexer to obtain a corresponding relation file (id-relation) of sequence names of the assembled genome and the reference genome.

2. And (3) comparing the second generation data of hic sequencing with the assembled genome by using a software hicup to obtain an aligned bam file.

3. And (3) clustering (cluster) the bam files in the step (2) by using a software allhic to obtain original hi files and assembly files of the clustered files.

4. And (3) manually adjusting the original, hic and assembly files in the step (3) by using software juicebox software to obtain adjusted assembly.

5. And for the case of synchronizing the chromosome sequence, replacing the sequence name in the adjusted assembly file with the sequence name which is always the same as the sequence name of the reference genome by using an id-assembly.

6. For the case of drawing the designated chromosome, according to the assembly file in the step 4, using the cache _ ID-assembly.

7. And for the case of chromosome sequencing, sequencing the sequencing files from large to small according to the chromosome length of the sequencing files in the step 4 by using an oder-assembly.

8. And 5, step 6, generating a final genome fasta file by the assembly file in the step 7, and drawing by using R software to obtain corresponding heat map pictures.

Wherein fig. 2A and 2B show the reference genome and modified-name genome heatmaps of example 1, respectively, and fig. 3 shows the modified-name genome and reference genome colinearity; fig. 4 shows a single chromosome Chr01 heat map, and fig. 5 shows the heat map after the chromosomes are sorted by size.

Example 2

This example provides an apparatus for genomic hic analysis. The corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome; the comparison module is configured to compare the second-generation data of the hic sequencing with the assembled genome to obtain a compared bam file; the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files; the adjusting module is configured for manually adjusting the hic files and the assembly files obtained in the clustering module to obtain the adjusted assembly. And a result file output module, the result file output module comprising: the name replacement sub-module is configured to replace the sequence name in the adjusted assembly file in the adjustment module with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the corresponding relation determination module to obtain an assembly file with the replaced name; the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name; and the sorting sub-module is configured to sort according to the dye length and the assembly file obtained in the adjusting module to obtain a sorted assembly file.

In a preferred embodiment of the present application, the apparatus further includes a heat map generating module, configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name, and the assembly file after sorting obtained in the result file output module, and use a software drawing to obtain the respective corresponding heat map pictures.

In a preferred embodiment of the present application, the name replacement submodule replaces the sequence name in the adjusted assembly file with a sequence name consistent with the reference genome using the program id-assembly.py; filtering the assembly file by using a program, namely, chorose _ ID-assembly.py in a filtering submodule and extracting a corresponding chromosome composition and a corresponding chromosome name; the sort submodule sorts by chromosome length using the program odd-assembly.

Example 3

The present embodiments provide a computer readable storage medium comprising a stored program, wherein the program when executed controls a device on which the storage medium is located to perform any of the methods of genomic hic analysis.

The present embodiment also provides a processor for executing a program, wherein the program executes a method for performing any of the above-described genomic hic analysis methods.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, for a specific example in this embodiment, reference may be made to the examples described in the above embodiment and optional implementation, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

From the above description, it can be seen that the above-described embodiments of the present invention achieve the following technical effects: the invention supports sequence name modification according to a reference genome, can sort according to chromosome length, and simultaneously supports outputting a single chromosome heat map for determining chromosome names.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of genomic hic analysis comprising the steps of:

s1, comparing an assembled genome with a reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome;

s2, comparing the second generation data of hic sequencing with the assembled genome to obtain a compared bam file;

s3, clustering the bam files obtained in the S2 to obtain original hic files and assembly files of the clustered files;

s4, manually adjusting the original hic file and the assembly file of the cluster file obtained in the S3 to obtain an assembly. And

s5, replacing the sequence name in the assembly file adjusted in the S4 with a sequence name consistent with the reference genome according to the corresponding relation file obtained in the S1 to obtain an assembly file with the replaced name; and/or

Filtering the assembly file obtained in the S4 and extracting corresponding chromosome composition and name to obtain an assembly file of the appointed chromosome name; and/or

And sorting the assembly files obtained according to the S4 according to the length of the chromosome to obtain sorted assembly files.

2. The method according to claim 1, further comprising S6, generating a genome fasta file from one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name and the ordered assembly file obtained in S5, and using a software drawing to obtain the corresponding heat map pictures.

3. The method of claim 1, wherein the assembled genome and the reference genome are aligned in S1 using software mulmer or lastz; and/or

Aligning the secondary data of hic sequencing with the assembled genome in the S2 by using a software hicup; and/or

And in the S3, clustering the bam files obtained in the S2 by using software allhic.

4. The method according to claim 1, wherein in S3, the bam files obtained in S2 are clustered using a software allhic; and/or

And in the S4, manually adjusting original hi and assembly files of the clustering files by using a software juicebox.

5. An apparatus for genomic hic analysis, comprising:

the corresponding relation determining module is configured to compare the assembled genome with the reference genome to obtain a corresponding relation file of sequence names of the assembled genome and the reference genome;

the alignment module is configured to align the second generation data of hic sequencing with the assembled genome to obtain an aligned bam file;

the clustering module is configured to cluster the bam files obtained in the comparison module to obtain original hic files and assembly files of the clustered files;

the adjusting module is configured to manually adjust the hic file and the assembly file which are obtained from the clustering module to obtain the adjusted assembly.hic and assembly files; and

a result file output module, the result file output module comprising:

a name replacement sub-module, configured to replace, according to the correspondence file obtained in the correspondence determination module, the sequence name in the assembly file adjusted in the adjustment module with a sequence name that is consistent with the reference genome, so as to obtain an assembly file with the replaced name;

the filtering sub-module is configured to filter the assembly file obtained in the adjusting module and extract the corresponding chromosome composition and name to obtain an assembly file of the specified chromosome name;

and the sorting sub-module is configured to sort according to the chromosome length according to the aaspace file obtained in the adjusting module to obtain a sorted assembly file.

6. The apparatus according to claim 5, further comprising a heat map generating module configured to generate a genome fasta file according to one or more of the assembly file after replacing the name, the assembly file specifying the chromosome name, and the assembly file after sorting obtained in the result file outputting module, and use a software drawing to obtain the corresponding heat map picture.

7. The apparatus of claim 5, wherein the correspondence determining module uses software multiplexer or lastz to align the assembled genome with the reference genome; and/or

And the alignment module uses software hicup to align the secondary data of hic sequencing with the assembled genome.

8. The apparatus of claim 5, wherein the clustering module uses a software allhic to cluster the bam files; and/or

And manually adjusting the original hic files and the assembly files of the clustering files by using a software juicebox in the adjusting module.

9. A computer readable storage medium, comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the method of genomic hic analysis according to any one of claims 1 to 4.

10. A processor configured to run a program, wherein the program when executed performs the method of genomic hic analysis of any of claims 1 to 4.