CN109326323B - Genome assembly method and device - Google Patents

Genome assembly method and device Download PDF

Info

Publication number
CN109326323B
CN109326323B CN201811068666.5A CN201811068666A CN109326323B CN 109326323 B CN109326323 B CN 109326323B CN 201811068666 A CN201811068666 A CN 201811068666A CN 109326323 B CN109326323 B CN 109326323B
Authority
CN
China
Prior art keywords
genome
chromosome
assembly
level
corrected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811068666.5A
Other languages
Chinese (zh)
Other versions
CN109326323A (en
Inventor
郑洪坤
刘福
刘敏
李绪明
邓德晶
刘东源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Biomarker Technologies Co ltd
Original Assignee
Beijing Biomarker Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Biomarker Technologies Co ltd filed Critical Beijing Biomarker Technologies Co ltd
Priority to CN201811068666.5A priority Critical patent/CN109326323B/en
Publication of CN109326323A publication Critical patent/CN109326323A/en
Application granted granted Critical
Publication of CN109326323B publication Critical patent/CN109326323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a genome assembly method and a genome assembly device, wherein the method comprises the following steps: acquiring corrected gene information, wherein the gene information comprises a genome and comparison information; performing genome assembly according to the genome and the alignment information; and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result. The device performs the above method. According to the genome assembling method and device provided by the embodiment of the invention, the assembling results which do not reach the preset condition are subjected to regional clustering treatment, and the processing results are reassembled, so that the genome assembling universality and accuracy can be improved.

Description

Genome assembly method and device
Technical Field
The embodiment of the invention relates to the technical field of genetic engineering, in particular to a genome assembly method and a genome assembly device.
Background
With the development of gene technology, the assembly of genome is important.
The existing gene assembly methods generally employ three types of software: LACHESIS, SALSA and 3 d-dna. LACHESIS is the initial software for Hi-C auxiliary genome assembly, has relatively strong universality, can be assembled to the chromosome level, is greatly influenced by the genome assembly level, is particularly suitable for the current three-generation sequencing assembly genome, is easy to fuse chromosomes on the divided chromosome groups, and has poor assembly accuracy; the software also lacks the function of error correction of the sequence and the function of evaluating the assembly accuracy of the whole genome bin heat map, and particularly has poor assembly effect on plant genome assembly. SALSA is used for Hi-C auxiliary genome assembly aiming at the third-generation assembled genome, and only the Scaffold index is improved, and chromosome-level assembly cannot be obtained. The 3d-dna software has the functions of firstly correcting errors and then assembling genomes, but has complex parameters, does not consider the whole genome, only considers a single sequence and is easy to generate excessive error correction; moreover, the chromosome cannot be divided into true number chromosomes, and a large number of chromosome fusion results exist; thirdly, the application effect of the plant and fungal genomes is poor, and the software has general universality.
Therefore, how to avoid the above-mentioned drawbacks and improve the versatility and accuracy of genome assembly is a problem that needs to be solved urgently.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for assembling a genome.
In a first aspect, embodiments of the present invention provide a method for assembling a genome, the method comprising:
acquiring corrected gene information, wherein the gene information comprises a genome and comparison information;
performing genome assembly according to the genome and the alignment information;
and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
In a second aspect, the embodiments of the present invention provide an apparatus for assembling a genome, the apparatus including:
the acquisition unit is used for acquiring the corrected gene information, and the gene information comprises a genome and comparison information;
the assembling unit is used for assembling the genome according to the genome and the alignment information;
and the reassembling unit is used for performing regional clustering processing on the assembling result and reassembling the processing result if the assembling result is judged to not reach the preset condition.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
acquiring corrected gene information, wherein the gene information comprises a genome and comparison information;
performing genome assembly according to the genome and the alignment information;
and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:
the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method comprising:
acquiring corrected gene information, wherein the gene information comprises a genome and comparison information;
performing genome assembly according to the genome and the alignment information;
and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
According to the genome assembling method and device provided by the embodiment of the invention, the assembling results which do not reach the preset condition are subjected to regional clustering treatment, and the processing results are reassembled, so that the genome assembling universality and accuracy can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a genome assembly method and a process according to an embodiment of the present invention;
FIG. 2 is a heat map of Hi-C sequencing Read distribution of Scaffold/Contig according to an embodiment of the present invention;
FIG. 3 is a genome assembly heatmap at chromosome level according to an embodiment of the invention;
FIG. 4 is a single chromosome zoning assembly heatmap of an embodiment of the invention;
FIG. 5 is a heat map of the results of the reassembly of single chromosome blocks according to an embodiment of the present invention;
FIG. 6 is a heat map of high heterozygous diploid genome single chromosome diploid assembly according to an embodiment of the present invention;
FIG. 7 is a block move map according to an embodiment of the present invention;
FIG. 8 is a block inversion diagram according to an embodiment of the present invention;
FIG. 9 is a telomere modulation map of an embodiment of the present invention;
FIG. 10 is a schematic diagram of an assembly apparatus for genome construction according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow diagram of a data processing method according to an embodiment of the present invention, and as shown in fig. 1, a data processing method according to an embodiment of the present invention includes the following steps:
s101: and acquiring the corrected gene information, wherein the gene information comprises a genome and comparison information.
Specifically, the device obtains the corrected gene information, which includes genome and alignment information. The apparatus is understood to mean, without being restricted in particular, the device or the like which carries out the method. Specifically, the following may be mentioned: cutting the genome at the Scaffold and/or Contig level into a genome at the bin level; converting the Hi-C comparison information file with the Scaffold and/or Contig level into a bin level comparison file; performing initial genome assembly according to the bin-level genome and the alignment file; identifying bins that cannot assemble to the corresponding positions of the original Scaffold and/or Contig sequences; and Hi-C sequencing Read in the vicinity of the bin covers the lowest point-disrupting sequence to obtain the genetic information. Sequence breakpoints and accuracy can be further confirmed by the distribution heatmap of the Read of Hi-C on the original Scaffold/Contig sequence. FIG. 2 is a Hi-C sequencing Read distribution heatmap of Scaffold/Contig according to an embodiment of the present invention, in which MQ represents the quality value of the Read alignment, and the results are classified into high quality and low quality Read alignments. The heat map is obviously divided into two areas, wherein the demarcation point is about 1,200Kb, the precise identification point is 1,168,000bp, the two areas are basically identical, as shown in FIG. 2, the whole sequence is divided into two blocks, the two blocks are obviously separated, and the Read without the Hi-C spans the two blocks, so that the assembled accurate genome Hi-C is uniformly distributed without generating obvious blocks according to the basic principle of the Hi-C, therefore, the assembly of the sequence is judged to have errors, the breakpoint is near the connection point of the two blocks, and the error correction of the sequence is completed by breaking the sequence at the breakpoint.
S102: and assembling the genome according to the genome and the alignment information.
Specifically, the device performs genome assembly according to the genome and the alignment information. Specifically, the following may be mentioned: dividing the chromosome group by using the genome at bin level and the alignment file; reducing the divided genome at bin level to the original error-corrected genome, and obtaining a chromosome group corresponding to each sequence of the original error-corrected genome; and (4) carrying out sequencing and orientation in the chromosome groups on the corrected genome sequences divided into each chromosome group so as to complete genome assembly at the chromosome level. FIG. 3 is a heat map of genome assembly at chromosome level according to an embodiment of the present invention, as shown in FIG. 3, it can be seen that each chromosome is significantly grouped, no chromosome fusion exists, signal continuity in chromosomes is good, and assembly effect is good.
The method may further comprise: counting the number distribution of enzyme cutting sites of the corrected genome sequence, and taking N80 and N95 of the number of the enzyme cutting sites as a range for dividing chromosome parameter circulation; and taking the parameter corresponding to the minimum difference between the longest chromosome and the shortest chromosome as a target chromosome partitioning parameter. The description for N80 is as follows: the number of the enzyme cutting sites of the corresponding sequences is arranged from long to short, and the number of the enzyme cutting sites of the corresponding sequences is called N80 when the lengths of the sequences are sequentially accumulated until the lengths of the sequences are more than 80 percent of the total length of the sequences. N95 is not described in detail.
Further, the method further comprises: and determining target sequencing parameters and target orientation parameters by calculating the number of enzyme cutting sites corresponding to N93 and N96 as the parameter cycle range of sequencing and orientation in the chromosome set and marking the orientation of the sequences. The descriptions of N93 and N96 are referred to above for N80 and will not be repeated. The directional score can be obtained by the existing technology, and it should be noted that: the orientation parameters corresponding to the higher orientation scores can be used as target orientation parameters, and the target ordering parameters can also be determined through the target orientation parameters due to the fact that the ordering parameters and the orientation parameters have a proportional corresponding relation.
S103: and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
Specifically, if the device judges that the assembly result does not reach the preset condition, the device carries out regional clustering processing on the assembly result and reassembles the processing result. The method for determining whether the assembly result meets the predetermined condition is a mature technique in the field and is not described in detail. Performing regional clustering processing on the assembly result, and reassembling the processing result, which may specifically be as follows: carrying out regional clustering on each chromosome; sequencing and orienting the sequences divided into each region respectively; sequencing data generated from a Hi-C library constructed using four-base restriction endonucleases was used to order and orient sequences within the chromosomal grouping to accomplish the reassembly. Fig. 4 is a heat map of the single chromosome partitioned assembly according to the embodiment of the present invention, as shown in fig. 4, it can be seen that the single chromosome partitioned assembly is obviously divided into 5 blocks, the assembly inside each block is basically free from problems, the blocks can be further assembled by subsequent identification to obtain an accurate chromosome sequence, a chromosome can be further partitioned into a designated number of groups (generally set as 5 groups) by means of laches software, and then each group is sorted and oriented, so as to obtain the assembly result of each block region of the single chromosome. Fig. 5 is a heat map of the assembly result after the block assembly of the single chromosome according to the embodiment of the present invention, as shown in fig. 5, it can be seen that the single chromosome signal is mainly concentrated near the diagonal line after the block assembly, no error is found, the single chromosome assembly result is very accurate, and the assembly result of the whole chromosome can be obtained by the LACHESIS assembly with the unit of each block. The assembly results can be viewed through a heatmap, and if the assembly results are not ideal, an iterative approach can be used by automatically determining errors in the assembly results, then breaking, and then assembling further until there are no apparent assembly errors.
Fig. 6 is a heat map of the high heterozygous diploid genome monosomal diploid assembly according to the embodiment of the present invention, as shown in fig. 6, the heat map is obviously divided into two blocks, and the two blocks are interacted as indicated by the existence of an inverse diagonal line, which indicates that the two blocks respectively correspond to two sets of haplotype genomes, and the specific dividing point is the dividing point of the two blocks.
Fig. 7 is a block moving map according to an embodiment of the present invention, as shown in fig. 7, the region can be seen to be obviously divided into 3 blocks, the top block and the bottom block obviously have interactive signals, so the top block is moved to the back of one block, so the signals are concentrated on the diagonal, for the correct assembly, the chromosome is divided according to the length (bin) of the specified size, and the signal value between any two bins (the number of links supported by Hi-C sequencing Read) is calculated, so a matrix M can be obtained (where Mi, j represents the signal value of the ith bin and the jth bin on the chromosome), and then the matrix information is used to plot the heatmap using ggplot 2; and (4) performing heat map adjustment after drawing, wherein the heat map adjustment comprises a manual moving method and a chromosome end telomere adjustment method. The principle of the artificial moving method is mainly based on that interaction signals are concentrated near a diagonal line of the same chromosome, if a stronger signal exists between a certain region of the chromosome far from the diagonal line and the region in one chromosome is in an independent state, namely the region signal has no obvious interaction signal with other regions of the chromosome, the region needs to be moved to the region with the interaction signals by using CGAP (software integrating the method) mapping software.
Fig. 8 is a reverse tone map of a block according to an embodiment of the present invention, as shown in fig. 8, if the lower edge and the upper edge of a block have interactive new numbers, and the upper edge and the lower edge of the block have interactive new numbers, this region is a reverse adjustment region, and the reverse operation can be performed by the tone map software of the CGAP.
FIG. 9 is a telomere modulation map of an embodiment of the present invention, and as shown in FIG. 9, the chromosome end modulation is based on the interaction phenomenon generated by the aggregation of chromosome ends in the plant cell nucleus, which is represented by the strong interaction phenomenon between the ends of each chromosome in the interaction heat map, so that the strong interaction between the chromosomes necessarily occurs at the ends of the chromosomes, and if the strong interaction occurs in the middle of the assembled chromosomes, it should be considered that the strong interaction signals between the chromosomes are located at the ends of the chromosomes by performing inversion or shift operations using CGAP self-contained mapping software, so as to obtain accurate chromosome assembly results.
The following is a brief description of a specific application of the method:
example 1: application of CGAP software to Hi-C assembly of rosaceous high-heterozygosity plant genome
The main assembly steps include the following operations:
1. and (3) utilizing bwa and HiC-Pro to complete comparison and quality control of Hi-C data, only retaining effective Hi-C data, and finally obtaining main comparison files of all ValidPairs.
2. Cutting the genome with the Contig level into a segment of sequence with the fixed length of 50kb (namely, cutting into a sequence with the bin being 50000), then converting the alignment file into the corresponding coordinates of the cut genome file, assembling by using an error correction module of the CGAP, and outputting the coordinates which are inconsistent with the original result. And meanwhile, removing identified error points inconsistent with the graph by combining a Read distribution graph, finally remaining 108 break points, and then breaking the sequence by using a self-contained breaking script of the CGAP, namely breaking _ scaffold _ recording _ to _ position. Meanwhile, the CGAP self-contained tool 04.convert the alignment information file into the alignment information file corresponding to the corrected genome using the CGAP self-contained tool 04.convert _ sam _ for _ new _ corrected _ genome v2. pl.
3. Using a chromosome clustering grouping module of CGAP, the error-corrected genome is divided into 7 chromosomes by splitting into 100,000 bins.
4. Sequencing and orienting each clustered chromosome by using a chromosome sequencing and orienting module of CGAP, wherein the number of the used lowest enzyme cutting sites is 20-28, after the software is operated, one parameter with the largest score in sequencing and orienting generated results is selected, the sequencing parameter is 25, and the orienting parameter is 20.
5. And (3) further dividing the single chromosome into 5 groups by using a single chromosome local hierarchical iterative assembly module of the CGAP, and then respectively sequencing and orienting each group by taking a single sequence as a unit to obtain the megascaffold. Then, a CGAP error identification module is used for identifying the megascaffold error, and the error identification module is interrupted and then further assembled, so that the 5 previous sequences of the megascaffold after error correction can be integrated into a complete chromosome, and a complete assembly result of a single chromosome can be obtained; after this step of assembly, the heat map shows a clear separation into two chromosomes, i.e., the two sets of haplotype genomes of the diploid are all divided and assembled.
6. Manual adjustments were made in conjunction with the assembly heatmap. And manual adjustment is performed by combining with the prior chart adjusting software. The regulation rule is that the chromosome ends are regions where telomeres are located, and strong chromosome interaction exists among chromosomes in the regions, so that some direction and position regions can be regulated.
According to the genome assembling method provided by the embodiment of the invention, the assembling results which do not reach the preset condition are subjected to regional clustering processing, and the processing results are reassembled, so that the genome assembling universality and accuracy can be improved.
On the basis of the above embodiment, the acquiring of the corrected gene information includes:
the genomes at the Scaffold and/or Contig level were cut into genomes at the bin level.
In particular, the device cuts the Scaffold and/or Contig level genome into the bin level genome. Reference may be made to the above embodiments, which are not described in detail.
And converting the Hi-C comparison information file at the Scaffold and/or Contig level into a bin level comparison file.
Specifically, the device converts the Hi-C comparison information file at the Scaffold and/or Contig level into a comparison file at the bin level. Reference may be made to the above embodiments, which are not described in detail.
Initial genome assembly was performed according to the bin level genome and the alignment file.
Specifically, the device performs initial genome assembly based on the bin-level genome and the alignment file. Reference may be made to the above embodiments, which are not described in detail.
Identifying bins that cannot assemble to the corresponding positions of the original Scaffold and/or Contig sequences; and Hi-C sequencing Read in the vicinity of the bin covers the lowest point-disrupting sequence to obtain the genetic information.
Specifically, the device identifies bins that cannot be assembled to the corresponding positions of the original Scaffold and/or Contig sequences; and Hi-C sequencing Read in the vicinity of the bin covers the lowest point-disrupting sequence to obtain the genetic information. Reference may be made to the above embodiments, which are not described in detail.
According to the genome assembling method provided by the embodiment of the invention, the Hi-C sequencing Read near bin covers the lowest point interrupt sequence to obtain the gene information, so that the corrected gene information can be reasonably and effectively obtained.
On the basis of the above embodiment, the genome assembling according to the genome and the alignment information includes:
the partitioning of the chromosome cohort is performed using the bin level genome and the alignment file.
Specifically, the apparatus performs the division of the chromosome group using the bin-level genome and the alignment file. Reference may be made to the above embodiments, which are not described in detail.
And reducing the divided genome at the bin level to the original error-corrected genome, and obtaining a chromosome group corresponding to each sequence of the original error-corrected genome.
Specifically, the apparatus reduces the divided bin-level genome to the original error-corrected genome, and obtains a chromosome group corresponding to each sequence of the original error-corrected genome. Reference may be made to the above embodiments, which are not described in detail.
And (4) carrying out sequencing and orientation in the chromosome groups on the corrected genome sequences divided into each chromosome group so as to complete genome assembly at the chromosome level.
Specifically, the apparatus performs sequencing and orientation within the chromosome group on the error-corrected genome sequences divided into each chromosome group to complete genome assembly at the chromosome level. Reference may be made to the above embodiments, which are not described in detail.
According to the genome assembly method provided by the embodiment of the invention, the genome at the chromosome level can be accurately assembled by sequencing and orienting the corrected genome sequence divided into each chromosome group in the chromosome group.
On the basis of the above embodiment, after the step of performing ordering and orientation within the chromosome group on the error-corrected genome sequences divided into each chromosome group, the method further comprises:
and (5) counting the number distribution of enzyme cutting sites of the corrected genome sequence, and taking N80 and N95 of the number of the enzyme cutting sites as a range for dividing chromosome parameter circulation.
Specifically, the device counts the number distribution of enzyme cutting sites of the corrected genome sequence, and adopts N80 and N95 of the number of the enzyme cutting sites as the range for dividing the chromosome parameter cycle. Reference may be made to the above embodiments, which are not described in detail.
And taking the parameter corresponding to the minimum difference between the longest chromosome and the shortest chromosome as a target chromosome partitioning parameter.
Specifically, the device takes the parameter corresponding to the minimum difference between the longest chromosome and the shortest chromosome as the target chromosome partitioning parameter. Reference may be made to the above embodiments, which are not described in detail.
The genome assembly method provided by the embodiment of the invention can reasonably determine the division parameters of the target chromosome.
On the basis of the above embodiment, the method further includes:
and calculating the number of enzyme cutting sites corresponding to N93 and N96 as the parameter cycle range of sequencing and orientation in the chromosome set.
Specifically, the device calculates the number of enzyme cutting sites corresponding to N93 and N96 as the circular range of the sequencing and orientation parameters in the chromosome set. Reference may be made to the above embodiments, which are not described in detail.
And determining a target sequencing parameter and a target orientation parameter by scoring the orientation of the sequence.
Specifically, the device determines a target ordering parameter and a target orientation parameter by scoring the orientation of the sequence. Reference may be made to the above embodiments, which are not described in detail.
The genome assembly method provided by the embodiment of the invention can reasonably determine the target sequencing parameters and the target orientation parameters.
On the basis of the above embodiment, the performing regional clustering processing on the assembly result and reassembling the processing result includes:
and carrying out regional clustering on each chromosome.
Specifically, the device performs regional clustering on each chromosome. Reference may be made to the above embodiments, which are not described in detail.
The sequences divided into each region are sorted and oriented separately.
Specifically, the device sorts and orients the sequences divided into each region, respectively. Reference may be made to the above embodiments, which are not described in detail.
Sequencing data generated from a Hi-C library constructed using four-base restriction endonucleases was used to order and orient sequences within the chromosomal grouping to accomplish the reassembly.
In particular, the apparatus uses sequencing data generated from a Hi-C library constructed with four-base restriction endonucleases to perform sequencing and orientation of sequences within the chromosome population to accomplish the reassembly. Reference may be made to the above embodiments, which are not described in detail.
According to the genome assembly method provided by the embodiment of the invention, sequencing data generated by a Hi-C library constructed by using four-base restriction endonuclease are used for sequencing and orienting sequences in a chromosome group, so that the genome assembly universality and accuracy can be further improved.
Fig. 10 is a schematic structural diagram of an assembly apparatus for a genome according to an embodiment of the present invention, and as shown in fig. 10, an embodiment of the present invention provides an assembly apparatus for a genome, including an obtaining unit 1001, an assembly unit 1002, and a reassembly unit 1003, where:
the obtaining unit 1001 is configured to obtain corrected gene information, where the gene information includes a genome and comparison information; the assembling unit 1002 is configured to perform genome assembling according to the genome and the alignment information; the reassembly unit 1003 is configured to perform regional clustering on the assembly result and reassemble the processing result if it is determined that the assembly result does not meet the preset condition.
Specifically, the obtaining unit 1001 is configured to obtain corrected gene information, where the gene information includes a genome and comparison information; the assembling unit 1002 is configured to perform genome assembling according to the genome and the alignment information; the reassembly unit 1003 is configured to perform regional clustering on the assembly result and reassemble the processing result if it is determined that the assembly result does not meet the preset condition.
According to the genome assembling device provided by the embodiment of the invention, the assembling results which do not reach the preset condition are subjected to regional clustering treatment, and the processing results are reassembled, so that the genome assembling universality and accuracy can be improved.
The genome assembly apparatus provided in the embodiments of the present invention may be specifically configured to execute the processing procedures of the above method embodiments, and the functions of the apparatus are not described herein again, and reference may be made to the detailed description of the above method embodiments.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 11, the electronic device includes: a processor (processor)1101, a memory (memory)1102, and a bus 1103;
the processor 1101 and the memory 1102 complete communication with each other through a bus 1103;
the processor 1101 is configured to call the program instructions in the memory 1102 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring corrected gene information, wherein the gene information comprises a genome and comparison information; performing genome assembly according to the genome and the alignment information; and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring corrected gene information, wherein the gene information comprises a genome and comparison information; performing genome assembly according to the genome and the alignment information; and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring corrected gene information, wherein the gene information comprises a genome and comparison information; performing genome assembly according to the genome and the alignment information; and if the assembly result is judged and known not to reach the preset condition, carrying out regional clustering processing on the assembly result, and reassembling the processing result.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of assembling a genome, comprising:
acquiring corrected gene information, wherein the gene information comprises a genome and comparison information;
performing genome assembly according to the genome and the alignment information;
if the assembly result is judged and known not to reach the preset condition, performing regional clustering processing on the assembly result, and reassembling the processing result;
wherein, the acquiring of the corrected gene information includes:
cutting the genome at the Scaffold and/or Contig level into a genome at the bin level;
converting the Hi-C comparison information file with the Scaffold and/or Contig level into a bin level comparison file;
performing initial genome assembly according to the bin-level genome and the alignment file;
identifying bins that cannot assemble to the corresponding positions of the original Scaffold and/or Contig sequences; and Hi-C sequencing Read in the vicinity of the bin covers the lowest point-disrupting sequence to obtain the genetic information.
2. The method of claim 1, wherein the assembling a genome from the genome and the alignment information comprises:
dividing the chromosome group by using the genome at bin level and the alignment file;
reducing the divided genome at bin level to the original error-corrected genome, and obtaining a chromosome group corresponding to each sequence of the original error-corrected genome;
and (4) carrying out sequencing and orientation in the chromosome groups on the corrected genome sequences divided into each chromosome group so as to complete genome assembly at the chromosome level.
3. The method of claim 2, wherein after the step of ordering and orienting within the chromosome group the error corrected genomic sequences partitioned into each chromosome group, the method further comprises:
counting the number distribution of enzyme cutting sites of the corrected genome sequence, and taking the number of enzyme cutting sites corresponding to N80 and N95 as the range for dividing chromosome parameter circulation;
and taking the parameter corresponding to the minimum difference between the longest chromosome and the shortest chromosome as a target chromosome partitioning parameter.
4. The method of claim 3, further comprising:
calculating the number of enzyme cutting sites corresponding to N93 and N96 as the circular range of sequencing and orientation parameters in the chromosome set;
and determining a target sequencing parameter and a target orientation parameter by scoring the orientation of the sequence.
5. The method according to any one of claims 1 to 4, wherein the performing regional clustering processing on the assembly result and reassembling the processing result comprises:
carrying out regional clustering on each chromosome;
sequencing and orienting the sequences divided into each region respectively;
sequencing data generated from a Hi-C library constructed using four-base restriction endonucleases was used to order and orient sequences within the chromosomal grouping to accomplish the reassembly.
6. An apparatus for assembling a genome, comprising:
the acquisition unit is used for acquiring the corrected gene information, and the gene information comprises a genome and comparison information;
the assembling unit is used for assembling the genome according to the genome and the alignment information;
the reassembly unit is used for performing regional clustering processing on the assembly result and reassembling the processing result if the assembly result is judged to be not up to the preset condition;
wherein, the acquiring of the corrected gene information includes:
cutting the genome at the Scaffold and/or Contig level into a genome at the bin level;
converting the Hi-C comparison information file with the Scaffold and/or Contig level into a bin level comparison file;
performing initial genome assembly according to the bin-level genome and the alignment file;
identifying bins that cannot assemble to the corresponding positions of the original Scaffold and/or Contig sequences; and Hi-C sequencing Read in the vicinity of the bin covers the lowest point-disrupting sequence to obtain the genetic information.
7. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.
8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.
CN201811068666.5A 2018-09-13 2018-09-13 Genome assembly method and device Active CN109326323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811068666.5A CN109326323B (en) 2018-09-13 2018-09-13 Genome assembly method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811068666.5A CN109326323B (en) 2018-09-13 2018-09-13 Genome assembly method and device

Publications (2)

Publication Number Publication Date
CN109326323A CN109326323A (en) 2019-02-12
CN109326323B true CN109326323B (en) 2022-03-18

Family

ID=65266004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811068666.5A Active CN109326323B (en) 2018-09-13 2018-09-13 Genome assembly method and device

Country Status (1)

Country Link
CN (1) CN109326323B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020726B (en) * 2019-03-04 2023-08-18 武汉希望组生物科技有限公司 Method and system for ordering assembly sequence
CN111584004B (en) * 2020-05-12 2023-06-27 西藏自治区农牧科学院水产科学研究所 Tibet characteristic fish genome assembly method based on three-dimensional histology data
CN111564180B (en) * 2020-05-12 2024-08-06 西藏自治区农牧科学院水产科学研究所 Method for comparative analysis of ancient chromosome evolution of Paeonia lactiflora
CN112289382B (en) * 2020-10-28 2023-02-24 天津诺禾致源生物信息科技有限公司 Splitting method and device for polyploid genome homologous chromosome and application thereof
CN113808668B (en) * 2021-11-18 2022-02-18 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN114464260B (en) * 2021-12-29 2023-09-26 上海诺禾致源医学检验实验室有限公司 Method and device for assembling chromosome horizontal genome
CN114566212B (en) * 2022-04-29 2022-09-16 天津诺禾致源生物信息科技有限公司 Method and device for carrying Hi-C genome larger than 10G
CN115810395B (en) * 2022-12-05 2023-09-26 武汉贝纳科技有限公司 T2T assembly method based on high-throughput sequencing animal and plant genome
CN115579061B (en) * 2022-12-07 2023-04-07 北京诺禾致源科技股份有限公司 Method and device for analyzing genome hic

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104017883B (en) * 2014-06-18 2015-11-18 深圳华大基因科技服务有限公司 The method and system of assembling genome sequence
CN104239750B (en) * 2014-08-25 2017-07-28 北京百迈客生物科技有限公司 Genome based on high-flux sequence data from the beginning assemble method
CN104298892B (en) * 2014-09-18 2017-05-10 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN108460245B (en) * 2017-02-21 2020-11-06 深圳华大基因科技服务有限公司 Method and apparatus for optimizing second generation assembly results using third generation sequences
CN107832584B (en) * 2017-09-18 2021-06-22 中国科学院深圳先进技术研究院 Gene analysis method, device, equipment and storage medium of metagenome
CN107563151B (en) * 2017-09-18 2020-09-22 杭州和壹基因科技有限公司 Error correction method for genome sequence assembled by PacBio sequencing data

Also Published As

Publication number Publication date
CN109326323A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN109326323B (en) Genome assembly method and device
US20230109993A1 (en) Industrial internet of things system for controlling production line parameter and control methods thereof
CN107305577B (en) K-means-based appropriate address data processing method and system
CN104951803B (en) Atmospheric distillation tower jet fuel endpoint flexible measurement method based on dynamic mobile window least square method supporting vector machine
CN109800220B (en) Big data cleaning method, system and related device
CN111259854A (en) Method and device for identifying structured information of table in text image
CN110020726B (en) Method and system for ordering assembly sequence
CN110428438B (en) Single-tree modeling method and device and storage medium
CN110796066B (en) Lane line group construction method and device
CN107463585A (en) Finger print data is put in storage processing method and processing device
CN108108883B (en) Clustering algorithm-based vehicle scheduling network elastic simplification method
CN104699804A (en) N-center point classification method based on gene expression programming
CN108286978A (en) Road network automatic generating method and device and hybrid navigation system
CN112231466A (en) Enterprise matching method and device in matching activities
CN117312296B (en) Multi-platform collaboration data processing method
CN110781340A (en) Offline evaluation method, system and device for recall strategy of recommendation system and storage medium
CN115083522B (en) Method and device for predicting cell types and server
CN110866437A (en) Color value determination model optimization method and device, electronic equipment and storage medium
CN114222307B (en) Method and device for determining sector overlapping coverage area and electronic equipment
CN113642766B (en) Method, device, equipment and medium for predicting power outage number of power system station
CN108846260B (en) Genetic map construction method and device for genetic segregation population
CN114417444A (en) Determination method and device for layout scheme and electronic equipment
CN103279761A (en) On-line vector selecting method for support vector machine
CN107944227B (en) Genetic map marking correction method and device
CN110544510A (en) contig integration method based on adjacent algebraic model and quality grade evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant