CN111243666B - Nextflow-based automatic analysis method and system for circular ribonucleic acid - Google Patents

Nextflow-based automatic analysis method and system for circular ribonucleic acid Download PDF

Info

Publication number
CN111243666B
CN111243666B CN202010024079.7A CN202010024079A CN111243666B CN 111243666 B CN111243666 B CN 111243666B CN 202010024079 A CN202010024079 A CN 202010024079A CN 111243666 B CN111243666 B CN 111243666B
Authority
CN
China
Prior art keywords
cyclic
ribonucleic acid
ribonucleic
row
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010024079.7A
Other languages
Chinese (zh)
Other versions
CN111243666A (en
Inventor
蔡宏民
魏焯辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010024079.7A priority Critical patent/CN111243666B/en
Publication of CN111243666A publication Critical patent/CN111243666A/en
Application granted granted Critical
Publication of CN111243666B publication Critical patent/CN111243666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The embodiment of the invention provides a Nextflow-based automatic analysis method and system for cyclic ribonucleic acid. According to the embodiment of the invention, analysis software of a plurality of cyclic ribonucleic acids is integrated through a Nextflow framework, results analyzed by the plurality of software are compared, deduplicated and screened, and analysis results of different software are integrated to obtain a final result, so that a more comprehensive and accurate prediction and analysis report of the cicRNA can be obtained.

Description

Nextflow-based automatic analysis method and system for circular ribonucleic acid
Technical Field
The invention relates to the technical field of biological analysis and big data mining, in particular to a Nextflow-based automatic analysis method and system for cyclic ribonucleic acid.
Background
circRNA (circular ribonucleic acid) is a special circular small non-coding RNA, and is also a recent research hotspot in the RNA field. The circRNA is different from the traditional linear RNA, and the molecular structure of the circRNA has the characteristics of sealing and circularity, so that the circRNA is not influenced by RNA exonuclease, is not easy to degrade and is more stable in gene expression.
Research in recent years shows that the circRNA molecules are rich in binding sites of microRNA (miRNA), so that the circRNA has the function of absorbing the miRNA (miRNA sponge), the inhibition of the miRNA on corresponding target genes in cells is relieved, and the expression level of the target genes is increased. This mechanism of action is known as a competitive endogenous RNA mechanism. Through the interaction with miRNA related to diseases, the naturally generated circular RNA molecule influences gene expression, and plays an important role in regulation and control in the aspects of occurrence and development of diseases, growth and development of organisms, resistance to external environment and the like.
In order to better and more fully search for circRNA, a circular RNA prediction tool based on the number of RNA sequencing sequences has been developed in recent years, including: STAR-based CIRCCexplor 2, BWA-based CIRI, mapsplice, segemehl, bowtie 2-based Find _ circ.
However, the above listed software has disadvantages in finding cicRNA.
Both Mapsplice and STAR-based CIRCCexpolor 2 have low false positive rates and can output credible circRNA lists, but since files are annotated by means of known genes, the circRNA of de novo cannot be found
Although Segemehl can find the most circrnas, the operation time is long, the memory consumption is large, the hardware configuration is required to be certain, the false positive rate is high, and the obtained circRNA list needs to be judged to a certain extent.
The running time of Find _ circ and CIRI is shorter, the prediction result can be obtained faster than that of other software, however, the obtained quantity of circRNAs is less, the method is limited by the same alignment algorithm and reference genome, and the two kinds of software have the problem of missing some circRNAs for prediction.
Therefore, how to obtain a more comprehensive and accurate prediction analysis report of the cicRNA is a technical problem which needs to be solved urgently.
Disclosure of Invention
The invention aims to provide a Nextflow-based automatic analysis method and system for cyclic ribonucleic acid, which can integrate analysis software of a plurality of cyclic ribonucleic acids through a Nextflow framework, compare the results analyzed by the plurality of software, and synthesize the analysis results of different software to obtain a final result, thereby obtaining a more comprehensive and accurate prediction analysis report of the cicRNA.
In a first aspect, the embodiments of the present invention provide a method for automatically analyzing a circular ribonucleic acid based on Nextflow, comprising the following steps:
s1, performing quality control on input original gene data and a reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein, fastp and Multiqc software are used to implement step S1;
s2, comparing the sequence fragments of the input sample to a reference genome sequence to confirm the specific position of each sequence on the genome; wherein, STAR, BWA, bowtie2 and Bowtie software are used to implement step S2 independently;
s3, after the sequence fragments of the input sample are compared with the reference genome sequence, confirming the sequence type and the number of the cyclic ribonucleic acid, and annotating the name of the cyclic ribonucleic acid and the position of the chromosome where the cyclic ribonucleic acid is located through an annotation file; wherein CIRCCexplor 2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are respectively used for independently realizing the step S3;
s4, merging and de-duplicating sequence types and numbers of the circular ribonucleic acids respectively obtained by CIRCeXplorer2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie 2;
s5, analyzing and interpreting the sequence types and the number of the combined and de-duplicated circular ribonucleic acids to generate a chart report obtained by aiming at the original data;
wherein steps S1-S5 are all in Nextflow.
Further, before the step S1, the Nextflow-based automated analysis method for circular ribonucleic acid further includes:
s0, establishing a comparative index file for the input genome sequence; wherein Bowtie, bowtie2 and STAR software are each used to implement step S0 independently.
Further, the merging and de-duplication of sequence types and numbers of circular ribonucleic acids respectively obtained by STAR-based circexplor 2, BWA-based CIRI, bowtie 2-based mapply, segemehl, and Bowtie 2-based Find _ circ software are specifically:
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
Further, if the detected cyclic ribonucleic acids are on the same chromosome, and the difference between the starting position of the base of the alignment result of the cyclic ribonucleic acid in the N-1 th row and the cyclic ribonucleic acid in the N-2 nd row in the order of order and the starting position of the base of the cyclic ribonucleic acid in the N-th row is less than or equal to 5, and the difference between the distance between the ending position of the base of the alignment result of the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order of order and the ending position of the base of the cyclic ribonucleic acid in the N-2 nd row is less than or equal to 5, the alignment result of the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row is of the same type as the cyclic ribonucleic acid in the N-2 th row.
Further, the rank order columns are ranked by the position and number of one type of cyclic ribonucleic acid, and the rank order columns are: chromosome-base start position-base end position-number of cyclic ribonucleic acids of this type.
Further, the chart report includes information on position analysis of the cyclic ribonucleic acid, information on length analysis of the cyclic ribonucleic acid, information on number analysis of the cyclic ribonucleic acid, and information on type analysis of the cyclic ribonucleic acid.
Further, the Nextflow-based automated analysis method for the circular ribonucleic acid further comprises the following steps:
running an instruction to automatically execute the configuration operation of the software environment according to the preset configuration steps;
the method comprises the steps of automatically capturing the hardware configuration of a current server, and automatically modifying the parameters of software according to the hardware configuration of the server.
In a second aspect, the embodiments of the present invention further provide a Nextflow-based circular ribonucleic acid automated analysis system, including:
the quality control module is used for performing quality control on the input original gene data and the reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein, fastp and Multiqc software are used to realize the function of the quality control module;
the alignment module is used for aligning the sequence fragments of the input sample to the reference genome sequence so as to confirm the specific position of each sequence on the genome; STAR, BWA, bowtie2 and Bowtie software are used for independently realizing the function of the comparison module respectively;
the quantitative module is used for confirming the sequence type and the number of the cyclic ribonucleic acid after comparing the sequence fragments of the input sample to the reference genome sequence, and annotating the name of the cyclic ribonucleic acid and the position of the chromosome where the cyclic ribonucleic acid is located through an annotation file; wherein, CIRCCexplor 2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are respectively used for independently realizing the function of the quantitative module;
a merging and de-duplication module for merging and de-duplicating the sequence types and the numbers of the circular ribonucleic acids respectively obtained by STAR-based CIRCCexplor 2, BWA-based CIRI, bowtie 2-based MapsPLice, segemehl and Bowtie 2-based Find _ circ software;
the report generation module is used for analyzing and interpreting the sequence types and the number of the combined and de-duplicated circular ribonucleic acids to generate a chart report aiming at the original data;
wherein the quality control module, the comparison module, the quantification module, the combined deduplication module and the report generation module are all in Nextflow.
Further, the merging and de-duplication of sequence types and numbers of circular ribonucleic acids respectively obtained by STAR-based circexplor 2, BWA-based CIRI, bowtie 2-based mapply, segemehl, and Bowtie 2-based Find _ circ software are specifically:
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
Further, if the detected cyclic ribonucleic acids are on the same chromosome, and the difference between the starting position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 th row and the cyclic ribonucleic acid in the N-2 nd row in the order and the starting position of the base of the cyclic ribonucleic acid in the N-th row is less than or equal to 5, and the difference between the distance between the ending position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order and the ending position of the base of the cyclic ribonucleic acid in the N-2 nd row is less than or equal to 5, the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row is of the same type as the cyclic ribonucleic acid in the N-2 nd row; wherein the order of the columns is ordered by the position and number of the one type of the cyclic ribonucleic acid, and the order of the columns is: chromosome-base start position-base end position-number of cyclic ribonucleic acids of this type.
According to the embodiment of the invention, a plurality of pieces of analysis software of the ring-shaped ribonucleic acid are integrated through a Nextflow framework, the results analyzed by the plurality of pieces of software are compared, deduplicated and screened, and the analysis results of different pieces of software are synthesized to obtain the final result, so that a more comprehensive and accurate prediction analysis report of the cicRNA can be obtained.
Drawings
FIG. 1 is a schematic diagram showing the use of the tools involved in the Nextflow-based automated analysis method for circular ribonucleic acid provided in example 1;
FIG. 2 is a schematic structural diagram of an automated Nextflow-based analysis system for circular RNA provided in example 2.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
Nextflow is a reactive workflow framework and programming domain specific language that can simplify the writing of data intensive flows. The design concept is that the Linux platform is a universal language for data science. Linux provides many simple but powerful command line and script tools that, when linked together, can simplify complex data manipulation. Nextflow extends this approach, adding the ability to define complex program interactions and advanced parallel computing environments based on dataflow programming models.
circRNA (circular ribonucleic acid) is a special circular small non-coding RNA, and is also the latest research hotspot in the RNA field. The circRNA is different from the traditional linear RNA, and the molecular structure of the circRNA has the characteristics of sealing and circularity, so that the circRNA is not influenced by RNA exonuclease, is not easy to degrade and is more stable in gene expression.
Research in recent years shows that the circRNA molecules are rich in binding sites of microRNA (miRNA), so that the circRNA has the function of absorbing the miRNA (miRNA sponge), the inhibition of the miRNA on corresponding target genes in cells is relieved, and the expression level of the target genes is increased. This mechanism of action is known as a competitive endogenous RNA mechanism. Through the interaction with miRNA related to diseases, the naturally generated circular RNA molecule influences gene expression, and plays an important role in regulation and control in the aspects of occurrence and development of diseases, growth and development of organisms, resistance to external environment and the like.
Example 1:
referring to fig. 1, fig. 1 is a schematic view of the tool usage involved in steps S1-S5.
The embodiment of the invention provides a Nextflow-based automatic analysis method for circular ribonucleic acid, which comprises the steps S1-S5, wherein the steps S1-S5 are all in Nextflow.
S1, performing quality control on input original gene data and a reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein Fastp and Multiqc software are used to implement step S1.
Wherein the ratio of guanine and cytosine is referred to as GC content, the first set value is 0.4, and the second set value is 0.6. The Multiqc depends on the analysis result of the fastp, and the Multiqc carries out comprehensive statistics on the quality control result of the fastp.
S2, comparing the sequence fragments of the input sample to a reference genome sequence to confirm the specific position of each sequence on the genome; wherein the STAR, BWA, bowtie2 and Bowtie software are each used to implement step S2 independently.
S3, after comparing the sequence fragments of the input sample to the reference genome sequence, confirming the sequence type and the number of the circular ribonucleic acid, and annotating the name of the circular ribonucleic acid and the position of the chromosome where the circular ribonucleic acid is located through an annotation file; wherein, circexplor 2 based on STAR, CIRI based on BWA, mapplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are used to implement step S3 independently.
Among them, CIRCCexplor 2 based on STAR is based on the idea of using a fusion gene to detect circRNA. The main process comprises the following steps: first, short sequences that STAR cannot align are filtered out and aligned to the genome using Tophat-Fusion. Sequences that are aligned to non-linear candidate positions on the genome with Tophat-Fusion are potential head-to-tail junction sequences. These sequences will then, with the help of genetic annotation, determine a more precise donor and acceptor position. Finally, circ RNA was annotated.
BWA-based CIRI is mainly based on aligning sequences onto large genomes. The specific process is that firstly, an index is established for a large reference genome through a BWT compression algorithm, and then the sequence is compared to the genome. CIRI is characterized by rapidity, accuracy and memory saving.
Segemehl is a software that maps short sequence reads to a reference genome. Segemehl implements a matching policy based on an Enhanced Suffix Array (ESA). For each suffix of a sequence fragment, the goal of Segemehl is to find the best scoring seed. The seed may contain insertions, deletions and mismatches (differences). The number of allowed differences for a seed is user controlled [ parameter-D, -difference ], which is critical when the program is running.
Mapsplice, a highly specific and sensitive transcriptome sequencing alignment algorithm published by Kai Wang et al in 2010 on Nucleic Acids Research. MapSPLice does not depend on the nature of the cleavage site or the length of the intron, and it can better detect new classical and non-classical cleavage sites. Mapspice makes a good trade-off between the quality of alignment and the diversity of the sequences. The algorithm is divided into two steps: label alignment and stitching reasoning.
Find _ circ is based on Bowtie2 alignment. The key step in circular RNA prediction based on high throughput sequencing data is to find binding sequences that cannot be aligned continuously to the genome or transcriptome. To accomplish this, the first step is to align the RNA sequences to the genome and then search for unaligned sequences. Find _ circ aligns these unaligned sequences to the genome again, taking 20 bases on each side (ensuring unique alignment to the genome). Next, the GU/AG cleavage sites were determined by short sequence alignment to infer potential circular RNA sequences.
And S4, merging and de-duplicating the sequence types and the number of the circular ribonucleic acids respectively obtained by CIRCeXplorer2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie 2.
Step S4 can be realized through a shell script and a python script.
In the embodiment of the present invention, the sequence types and numbers of the cyclic ribonucleic acids obtained by circexplor 2 based on STAR, CIRI based on BWA, mapple based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are combined and de-duplicated, specifically:
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
Wherein, if the detected cyclic RNAs are in the same chromosome, and the difference between the starting position of the base of the comparison result between the cyclic RNAs in the N-1 st row and the cyclic RNAs in the N-2 nd row in the order and the starting position of the base of the cyclic RNAs in the N-th row is less than or equal to 5, and the difference between the distance between the ending position of the base of the comparison result between the cyclic RNAs in the N-1 st row and the cyclic RNAs in the N-2 nd row in the order and the ending position of the base of the cyclic RNAs in the N-2 nd row is less than or equal to 5, the comparison result between the cyclic RNAs in the N-1 st row and the cyclic RNAs in the N-2 nd row is of the same type as the cyclic RNAs in the N-th row. The order of the sequence of the order of the positions and the number of the types of the cyclic ribonucleic acids is: chromosome-base start position-base end position-number of cyclic ribonucleic acids of this type. The chromosomes, the base starting positions, the base ending positions and the number of the annular ribonucleic acids are arranged from small to large; wherein, the chromosomes are preferentially arranged from small to large, if the chromosomes are the same, the initial positions of the bases are preferentially arranged from small to large, and so on.
The data for the same type of circular ribonucleic acid were combined according to the following method:
if the cyclic ribonucleic acids in the N-1 row and the N row belong to the same type, taking the position of the row with the largest number as the position of the current merging result, wherein the position comprises the chromosome, the starting position of the base of the chromosome and the ending position of the base of the chromosome;
if the number of the circRNAs is the same, taking the position of the row with the longest circRNA length as the position of the current merging result;
if the lengths of the circRNAs are the same, the position of the row with the smallest starting position of the base is taken as the position of the current combination result.
It should be understood that in the present embodiment, the above-mentioned 5 software (STAR-based circexplor 2, BWA-based CIRI, bowtie 2-based mapply, segemehl, and Bowtie 2-based Find _ circ software) will obtain 5 different results for the same sample, and the final number of the cyclic rnas of this type is determined by taking the average of the numbers of all the cyclic rnas of the same type in these 5 results.
For the sake of understanding, the following description will be briefly made by taking three software examples.
For example: the software CIRCeXplorer2 based on STAR, abbreviated as A tool, detects a certain type of circular RNA at the following positions: the number of bases from 10 th to 100 th of chromosome 1 is 1. The method is simplified as follows: a:1-10-100-1
BWA-based CIRI software, referred to as the B tool, detects a certain type of circular rna at the following positions: chromosome 1 from 8 th to 98 th bases, which is 2 in number. The method is simplified as follows: b:1-8-98-2
Based on the Mapsplice software of Bowtie2, called C tool for short, a certain type of cyclic RNA is detected, whose position is: chromosome 1 from base 4 to base 94, which is 1 in number. The method is simplified as follows: c:1-4-94-1
Then, the ordering principle is according to the above principle: A. b, C has the sequence:
C:1-4-94-1
B:1-8-98-2
A:1-10-100-1
comparing in sequence, since |4-8| <5 and |94-98| <5, C, B belongs to the same type, so C and B need to be combined, since the amount of the ring-shaped ribonucleic acid of the type detected by C is 1,B and 2 of the amount of the ring-shaped ribonucleic acid is detected, the position of the row B is taken as the position of the current combination, namely the position after combination is B:1-8-98, in an amount that is averaged, i.e., (1+2)/2 =1.5; that is, the result after alignment (also referred to as the result after current combination) is B:1-8-98-1.5;
and continuously comparing the results after the current combination according to the sequence, namely comparing the results with the results of A: 1-10-100-1. Since |8-10| <5, |98-100| <5, i.e. the aligned result and the detected cyclic ribonucleic acid are the same type as A, the aligned result and A are combined to obtain the combined result at the position B:1-8-98-1.25.
And S5, analyzing and interpreting the sequence types and the number of the combined and de-duplicated circular ribonucleic acids to generate a chart report obtained by aiming at the original data.
The embodiment of the invention can be realized by analyzing the software R and the corresponding analysis code.
In an embodiment of the invention, the chart report comprises information on position analysis of circRNA, information on length analysis of circRNA, information on quantity analysis of circRNA, and information on type analysis of circRNA.
According to the embodiment of the invention, a plurality of pieces of analysis software of the ring-shaped ribonucleic acid are integrated through a Nextflow framework, the results analyzed by the plurality of pieces of software are compared, deduplicated and screened, and the analysis results of different pieces of software are synthesized to obtain the final result, so that a more comprehensive and accurate prediction analysis report of the cicRNA can be obtained.
In addition, through a Nextflow framework, the embodiment of the invention can automatically connect gene analysis software of different steps and automatically process the software analysis result obtained in each step, thereby improving the analysis efficiency of a machine, reducing artificial participation and improving the analysis efficiency in accuracy.
In a preferred embodiment, the Nextflow-based automated analysis method for circular ribonucleic acids further comprises:
running an instruction to automatically execute the configuration operation of the software environment according to the preset configuration steps;
the method comprises the steps of automatically capturing the hardware configuration of a current server, and automatically modifying the parameters of software according to the hardware configuration of the server.
The configuration of the software environment requires manual operation and requires complex steps for configuration, but in the embodiment of the invention, the steps are written into the instruction in advance, so that the user only needs to operate the instruction, and the system (software) automatically executes the operations of downloading, installing and configuring. Wherein the user can cause the instruction to run by clicking a button.
Because some parameters of software operation are different due to different hardware configurations of the server, the parameters of the software need to be set.
Preferably, different software of the same step can run in parallel, so that the resource is utilized to the maximum extent, and the analysis time is saved as much as possible.
Example 2:
the embodiment of the invention also provides a Nextflow-based automatic analysis system for the circular ribonucleic acid, which comprises the following steps:
the quality control module 11 is used for performing quality control on the input original gene data and the reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein, fastp and Multiqc software are used to realize the function of the quality control module;
an alignment module 12 for aligning the sequence fragments of the input sample to the reference genome sequence to confirm the specific position of each sequence on the genome; STAR, BWA, bowtie2 and Bowtie software are used for independently realizing the function of the comparison module respectively;
a quantitative module 13, configured to, after comparing the sequence fragments of the input sample to the reference genome sequence, determine the sequence type and number of the cyclic ribonucleic acid, and annotate the name of the cyclic ribonucleic acid and the position of the chromosome where the cyclic ribonucleic acid is located through an annotation file; wherein, CIRCCexplor 2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are respectively used for independently realizing the function of the quantitative module;
a combining and de-duplication module 14, configured to combine and de-duplicate sequence types and numbers of cyclic ribonucleic acids respectively obtained by STAR-based circxplor 2, BWA-based CIRI, bowtie 2-based mapply, segemehl, and Bowtie 2-based Find _ circ software;
a report generation module 15, configured to analyze and interpret the sequence type and number of the merged and deduplicated cyclic ribonucleic acids, and generate a chart report obtained for the raw data;
wherein, the quality control module 11, the comparison module 12, the quantification module 13, the combined deduplication module 14 and the report generation module 15 are all in Nextflow.
In one preferred embodiment, the sequence types and numbers of the cyclic ribonucleic acids obtained by STAR-based CIRCAXPLORer 2, BWA-based CIRI, bowtie 2-based MapsPLice, segemehl and Bowtie 2-based Find _ circ software are combined and de-duplicated, specifically,
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
In a preferred embodiment, if the detected cyclic RNAs are on the same chromosome, and the difference between the start position of the base in the alignment result of the cyclic RNA in the N-1 st row and the cyclic RNA in the N-2 nd row in the order of the order and the start position of the base in the cyclic RNA in the N-th row is less than or equal to 5, and the difference between the distance between the end position of the base in the alignment result of the cyclic RNA in the N-1 st row and the cyclic RNA in the N-2 nd row in the order and the end position of the base in the cyclic RNA in the N-2 nd row is less than or equal to 5, the alignment result of the cyclic RNA in the N-1 st row and the cyclic RNA in the N-2 nd row is of the same type as the cyclic RNA in the N-th row; wherein the rank ordered columns are ordered by the position and number of one type of circular ribonucleic acid, and the rank ordered columns are: chromosome-base start position-base end position-number of cyclic ribonucleic acids of this type.
It should be noted that the embodiments of the present invention provide a system corresponding to the Nextflow-based automated rna analysis method of example 1, and therefore, the embodiments of the present invention will not be described in detail herein.
According to the embodiment of the invention, a plurality of pieces of analysis software of the ring-shaped ribonucleic acid are integrated through a Nextflow framework, the results analyzed by the plurality of pieces of software are compared, deduplicated and screened, and the analysis results of different pieces of software are synthesized to obtain the final result, so that a more comprehensive and accurate prediction analysis report of the cicRNA can be obtained.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (9)

1. An automatic analysis method of circular ribonucleic acid based on Nextflow, which is characterized by comprising the following steps:
s1, performing quality control on input original gene data and a reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein, fastp and Multiqc software are used to implement step S1;
s2, comparing the sequence fragments of the input sample to a reference genome sequence to confirm the specific position of each sequence on the genome; wherein, STAR, BWA, bowtie2 and Bowtie software are used to implement step S2 independently;
s3, after the sequence fragments of the input sample are compared with the reference genome sequence, confirming the sequence type and the number of the cyclic ribonucleic acid, and annotating the name of the cyclic ribonucleic acid and the position of the chromosome where the cyclic ribonucleic acid is located through an annotation file; wherein CIRCCexplor 2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are respectively used for independently realizing the step S3;
s4, merging and de-duplicating sequence types and numbers of the circular ribonucleic acids respectively obtained by CIRCeXplorer2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie 2;
s5, analyzing and interpreting the sequence types and the number of the combined and de-duplicated circular ribonucleic acids to generate a chart report obtained by aiming at the original data;
wherein, the steps S1-S5 are all in Nextflow;
if the detected cyclic RNAs are on the same chromosome, and the difference between the start position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order of ordering and the start position of the base of the cyclic ribonucleic acid in the N-2 th row is less than or equal to 5, and the difference between the end position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order of ordering and the end position of the base of the cyclic ribonucleic acid in the N-2 nd row is less than or equal to 5, the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row is of the same type as the cyclic ribonucleic acid in the N-2 nd row.
2. The Nextflow-based automated analysis method for circular ribonucleic acids according to claim 1, further comprising, before step S1:
s0, establishing a comparative index file for the input genome sequence; wherein Bowtie, bowtie2 and STAR software are each used to implement step S0 independently.
3. The Nextflow-based automated analysis method for circular ribonucleic acids according to claim 1 or 2, characterized in that the sequence types and the number of circular ribonucleic acids obtained by STAR-based CIRCCexplor 2, BWA-based CIRI, bowtie 2-based Mapsplice, segemehl and Bowtie 2-based Find _ circ software are combined and de-duplicated, specifically:
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
4. Nextflow-based automated analytical method of circular ribonucleic acids according to claim 3, characterised in that the rank ordered columns are ordered by the position and number of one type of circular ribonucleic acid, and the rank ordered columns are: chromosome-base start position-base end position-number of circular ribonucleic acid of the type.
5. The Nextflow-based automated analysis method for cyclic ribonucleic acids according to claim 4, wherein the chart report includes information on the location analysis of cyclic ribonucleic acids, information on the length analysis of cyclic ribonucleic acids, information on the number analysis of cyclic ribonucleic acids, and information on the type analysis of cyclic ribonucleic acids.
6. The Nextflow-based automated analysis method for circular ribonucleic acids according to claim 1, further comprising:
running an instruction to automatically execute the operation of configuring the software environment according to the preset configuration steps;
automatically capturing the hardware configuration of the current server, and modifying the parameters of the software according to the hardware configuration of the server.
7. An automated Nextflow-based circular ribonucleic acid analysis system, comprising:
the quality control module is used for performing quality control on the input original gene data and the reference genome sequence, removing abnormal fragments with the mass fraction lower than a first set value and the GC content in the genome sequence higher than a second set value, and generating a quality control report; wherein, fastp and Multiqc software are used to realize the function of the quality control module;
the alignment module is used for aligning the sequence fragments of the input sample to the reference genome sequence so as to confirm the specific position of each sequence on the genome; STAR, BWA, bowtie2 and Bowtie software are used for independently realizing the function of the comparison module respectively;
the quantitative module is used for confirming the sequence type and the number of the cyclic ribonucleic acid after comparing the sequence fragments of the input sample to the reference genome sequence, and annotating the name of the cyclic ribonucleic acid and the position of the chromosome where the cyclic ribonucleic acid is located through an annotation file; wherein CIRCeXplorer2 based on STAR, CIRI based on BWA, mapsplice based on Bowtie2, segemehl and Find _ circ software based on Bowtie2 are respectively used for independently realizing the functions of the quantitative module;
a merging and de-duplication module for merging and de-duplicating the sequence types and the numbers of the circular ribonucleic acids respectively obtained by STAR-based CIRCCexplor 2, BWA-based CIRI, bowtie 2-based MapsPLice, segemehl and Bowtie 2-based Find _ circ software;
the report generation module is used for analyzing and interpreting the sequence types and the number of the combined and de-duplicated circular ribonucleic acids to generate a chart report aiming at the original data;
wherein the quality control module, the comparison module, the quantification module, the combined deduplication module and the report generation module are all in a Nextflow;
if the detected cyclic RNAs are on the same chromosome, and the difference between the start position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order of ordering and the start position of the base of the cyclic ribonucleic acid in the N-2 th row is less than or equal to 5, and the difference between the end position of the base of the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row in the order of ordering and the end position of the base of the cyclic ribonucleic acid in the N-2 nd row is less than or equal to 5, the comparison result between the cyclic ribonucleic acid in the N-1 st row and the cyclic ribonucleic acid in the N-2 nd row is of the same type as the cyclic ribonucleic acid in the N-2 nd row.
8. The Nextflow-based automated analysis system for RNA of claim 7, wherein the sequences of types and amounts of RNA obtained from STAR-based CIRCCexplor 2, BWA-based CIRI, bowtie 2-based Mapsplice, segemehl, and Bowtie 2-based Find _ circ software are combined and de-duplicated, specifically,
merging the data of the same type of cyclic ribonucleic acid, and deleting the data of the same type of cyclic ribonucleic acid before merging; wherein the final amount of the same type of the cyclic ribonucleic acids is the combined amount of the cyclic ribonucleic acids of the type, and the final amount is the average of the amounts of all the cyclic ribonucleic acids of the same type.
9. The Nextflow-based automated analysis system for cyclic ribonucleic acids according to claim 8, characterized in that the rank ordered columns are ordered by the position and number of one type of cyclic ribonucleic acid: chromosome-base start position-base end position-number of cyclic ribonucleic acids of this type.
CN202010024079.7A 2020-01-08 2020-01-08 Nextflow-based automatic analysis method and system for circular ribonucleic acid Active CN111243666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010024079.7A CN111243666B (en) 2020-01-08 2020-01-08 Nextflow-based automatic analysis method and system for circular ribonucleic acid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010024079.7A CN111243666B (en) 2020-01-08 2020-01-08 Nextflow-based automatic analysis method and system for circular ribonucleic acid

Publications (2)

Publication Number Publication Date
CN111243666A CN111243666A (en) 2020-06-05
CN111243666B true CN111243666B (en) 2023-04-07

Family

ID=70866226

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010024079.7A Active CN111243666B (en) 2020-01-08 2020-01-08 Nextflow-based automatic analysis method and system for circular ribonucleic acid

Country Status (1)

Country Link
CN (1) CN111243666B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599409B (en) * 2020-05-20 2022-05-20 电子科技大学 circRNA recognition method based on MapReduce parallelism
CN113344076B (en) * 2021-06-08 2022-03-22 汕头大学 Integrated learning-based circRNA-miRNA interaction relation prediction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573125A (en) * 2018-04-19 2018-09-25 上海亿康医学检验所有限公司 Method for detecting genome copy number variation and device comprising same
CN108660199A (en) * 2018-05-20 2018-10-16 北京宏微特斯生物科技有限公司 A method of pathogen is detected based on cfDNA high-flux sequences
CN110047560A (en) * 2019-03-15 2019-07-23 南京派森诺基因科技有限公司 A kind of protokaryon transcript profile automated analysis method based on the sequencing of two generations

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107849612B (en) * 2015-03-26 2023-04-14 奎斯特诊断投资股份有限公司 Alignment and variant sequencing analysis pipeline

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573125A (en) * 2018-04-19 2018-09-25 上海亿康医学检验所有限公司 Method for detecting genome copy number variation and device comprising same
CN108660199A (en) * 2018-05-20 2018-10-16 北京宏微特斯生物科技有限公司 A method of pathogen is detected based on cfDNA high-flux sequences
CN110047560A (en) * 2019-03-15 2019-07-23 南京派森诺基因科技有限公司 A kind of protokaryon transcript profile automated analysis method based on the sequencing of two generations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HybridSim-VS: a web server for large-scale ligand-based virtual screening using hybrid similarity recognition techniques;Jinling Shang et al.;《Bioinformatics》;20171101;第33卷(第21期);第3480-3481页 *

Also Published As

Publication number Publication date
CN111243666A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
Raghavan et al. A simple guide to de novo transcriptome assembly and annotation
Birney et al. GeneWise and genomewise
Laehnemann et al. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction
Kwon et al. oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets
Schatz et al. Assembly of large genomes using second-generation sequencing
US8428882B2 (en) Method of processing and/or genome mapping of diTag sequences
CN111243666B (en) Nextflow-based automatic analysis method and system for circular ribonucleic acid
KR20160073406A (en) Systems and methods for using paired-end data in directed acyclic structure
Morrison Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences
CN113571131B (en) Pangenome construction method and corresponding structural variation mining method
Chen et al. Recent advances in sequence assembly: principles and applications
Jiang et al. EbEST: an automated tool using expressed sequence tags to delineate gene structure
Emms et al. Benchmarking orthogroup inference accuracy: revisiting orthobench
CN112764922B (en) Parallel sequence comparison method and device based on load balancing and computer equipment
Rinner et al. AGenDA: gene prediction by comparative sequence analysis
WO2023209614A1 (en) Guide design and off-target searches
Pavesi et al. Using Weeder for the discovery of conserved transcription factor binding sites
Ylla et al. MirCure: a tool for quality control, filter and curation of microRNAs of animals and plants
Lorente-Martínez et al. Genomic Fishing and Data Processing for Molecular Evolution Research. Methods Protoc. 2022, 5, 26
Danek et al. Application of the Burrows-Wheeler transform for searching for approximate tandem repeats
US20220284986A1 (en) Systems and methods for identifying exon junctions from single reads
Miranker et al. Biosequence Use Cases in MoBIoS SQL.
Synnes An effective sequence-based pipeline for pathogen discovery
Moyer et al. Updated Database and Evolutionary Dynamics of U12-Type Introns
Nip Transcriptome assembly and visualization for RNA-sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant