CN110600080B - Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof - Google Patents

Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof Download PDF

Info

Publication number
CN110600080B
CN110600080B CN201910850896.5A CN201910850896A CN110600080B CN 110600080 B CN110600080 B CN 110600080B CN 201910850896 A CN201910850896 A CN 201910850896A CN 110600080 B CN110600080 B CN 110600080B
Authority
CN
China
Prior art keywords
family
nucleic acid
aptamer
score
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910850896.5A
Other languages
Chinese (zh)
Other versions
CN110600080A (en
Inventor
杨朝勇
宋彦龄
宋佳
郑媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renji Hospital Shanghai Jiaotong University School of Medicine
Original Assignee
Renji Hospital Shanghai Jiaotong University School of Medicine
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renji Hospital Shanghai Jiaotong University School of Medicine filed Critical Renji Hospital Shanghai Jiaotong University School of Medicine
Priority to CN201910850896.5A priority Critical patent/CN110600080B/en
Publication of CN110600080A publication Critical patent/CN110600080A/en
Application granted granted Critical
Publication of CN110600080B publication Critical patent/CN110600080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a comprehensive functional nucleic acid identification method based on a multidimensional analysis framework, which comprises the following steps: screening functional nucleic acids by using a multi-dimensional analysis framework based on nucleic acid library sequencing data; the multi-dimensional analysis framework comprises: the content of the structure domain/pattern sequence Kscore, the enrichment degree Fscore of the nucleic acid family and the structure stability Sscore. The invention provides a multi-dimensional analysis framework for the first time, and based on multi-level structural analysis and sequence analysis, technical means such as unsupervised learning, pattern sequence search, structural simulation and the like are utilized to carry out multi-dimensional evaluation on aptamer sequences, so that the aptamers with target binding capacity are identified. The algorithm has the capability of efficiently processing high-throughput data and comprehensively identifying high-performance aptamer.

Description

Multi-dimensional analysis framework-based functional nucleic acid comprehensive identification method and application thereof
Technical Field
The invention belongs to the field of high-throughput data processing, and particularly relates to a comprehensive functional nucleic acid identification method based on a multi-dimensional analysis framework and application thereof.
Background
The auxiliary screening of the aptamer based on analysis means such as data mining and machine learning is a new idea and trend, and the auxiliary screening of the aptamer is mainly divided into aptamer design based on machine simulation, deep learning and the like and aptamer identification based on library second-generation sequencing data mining. The former multi-gene is based on molecular docking modeling, but little progress is made because of the small number of resolved proteins, nucleic acid structures and protein-nucleic acid composite structures; for the latter, more and more working evidence plaintext library second-generation sequencing data can assist a ligand systematic evolution technology (SELEX) system of index enrichment since 2010, and the method helps to improve screening success rate so as to reduce screening rounds, screen more and better aptamers and understand the in vitro index enrichment ligand systematic evolution process.
The high-throughput sequencing data of the second-generation library brings new opportunities for aptamer screening, but the library data is high in complexity, large in data quantity, complex in evolution mechanism and small in corresponding mechanism research basis, so that the analysis algorithm of the sequencing data of the second-generation nucleic acid library is slow to develop. In addition, the algorithm is suppressed by low sensitivity, low accuracy, long calculation time and the like, and is not really and widely applied to library sequence analysis. Therefore, the development of a high-sensitivity, accurate and efficient analysis platform is the core of the prior SELEX system assisted by better utilizing second-generation sequencing data.
Firstly, effective data filtering is a premise for realizing high-throughput data efficient processing of the library, but the existing data filtering method is single and inefficient. The filtering is mainly performed based on random sampling or simply setting a frequency threshold of the aptamer, so that a plurality of false negative results are easily caused, particularly the loss of the low-frequency high-performance aptamer. Therefore, developing a more suitable filtering means is one of the important problems to be solved urgently in the field.
Secondly, an accurate and efficient sequence family classification method is the only way for cleaning library composition, but the existing method has the problems of narrow application range, low efficiency and the like. Early family analysis was based on statistics of consensus sequences, but library construction was complex, and simple sequence statistics often failed to truly generalize library sequence composition. Thus, later developed algorithms mostly perform sequence family classification based on sequence similarity (e.g., aptaccluster, etc.), and how to measure the similarity between sequences becomes the core of these algorithms. The currently available library sequence similarity measures are mainly based on the edit distance (LD) and the Locality-Sensitive Hashing (LSH). These measures either fail to take into account base insertions, deletions, or require sequences of equal length, and are narrow in applicability. And the calculation time consumption of the algorithms is long, and the calculation time consumption approaches to the square order increase along with the increase of the number of the measured sequences, so that the analysis problem of the second-generation sequencing data of the nucleic acid library cannot be efficiently solved, and the algorithms are not widely applied. Therefore, there is an urgent need to find more efficient and widely applicable measures and classifications of nucleic acid sequence similarity.
Finally, the library evolution mechanism is complex, and each aptamer has different evolution characteristics. The low-frequency high-performance aptamer cannot be identified through sequence family analysis, and omission of the low-frequency high-performance aptamer is a main source of false negative of the algorithm in the field at present. On the other hand, many sequences with high frequencies or constituting a large family may be introduced by non-specific adsorption and amplification preferences, a phenomenon that often leads to false positive results. Therefore, unlike the family classification-based algorithms, many of the algorithms in recent years have centered on sequence secondary structure analysis. However, the existing secondary structure prediction software is not suitable for high-throughput data, so that many works are carried out by taking secondary substructures as research objects. The algorithms start from a 'k-mer' (k long base sequence), one part of the algorithms assume that the 'k-mer' with significant enrichment is a secondary substructure with binding capacity, and the other part of the algorithms integrate prediction of the secondary substructure (classical nucleic acid secondary substructures comprise false knots, stem loops, bulges, hairpins and the like) and search of the 'k-mer' for significant enrichment. Compared with the sequence family classification, the algorithm based on the 'k-mer' significance enrichment can process larger data amount, and the algorithm for predicting the integrated substructure still has long calculation time because of the need of additional secondary substructure prediction. In conclusion, these methods either only consider substructures and thus are not highly accurate or cannot be applied to second generation high throughput sequencing data. How to effectively utilize the secondary structure information and realize the processing of high-throughput data is the core of improving the accuracy of the existing aptamer discrimination algorithm.
Disclosure of Invention
The invention aims to provide a comprehensive nucleic acid identification method based on a multidimensional analysis framework, which solves the following problems:
1) The second generation sequencing data of the nucleic acid library has the problems of very large data volume and high computational complexity;
2) The problems of difficult similarity measurement of nucleic acid identification and low efficiency of family classification;
3) Due to the preference of amplification and the existence of non-specific adsorption, the high-frequency low-performance aptamer is easy to be introduced by mistake;
4) Due to the randomness of sequencing and the complex evolution of a nucleic acid library, the low-frequency high-performance aptamer is easy to lose;
in order to achieve the above object, the present invention provides a method for comprehensively identifying nucleic acids based on a multidimensional analysis framework, comprising the steps of:
step 1: performing nucleic acid library sequencing data filtering based on pattern sequence search;
step 2: performing high-efficiency classification on the sequencing data of the nucleic acid library filtered in the step 1 by using unsupervised learning;
and step 3: based on the library sequence family classification result of the step 2, carrying out Kscore evaluation on the secondary substructure/pattern sequence content of the aptamer;
and 4, step 4: based on the library sequence family classification result of the step 2, carrying out Fscore evaluation on the enrichment degree of the aptamer family according to the family size;
and 5: based on the prediction of the minimum free energy of the secondary structure and the structure of the G tetramer, evaluating the Sscore of the stability of the secondary structure of the aptamer on the classification result of the library sequence family in the step 2;
step 6: and comprehensively evaluating and balancing the functional nucleic acid from three aspects of secondary substructure/pattern sequence content Kscore, aptamer family enrichment degree Fscore and secondary structure stability Sscore by using a multidimensional analysis framework, and identifying the high-performance functional nucleic acid.
Preferably, the nucleic acid library sequencing data in step 1 is second generation library sequencing data.
Preferably, the step 1 specifically includes: screening sequencing data of a nucleic acid library based on the frequency distribution and amplification multiple distribution of k-mers, wherein the k-mers are k-long continuous base segments, and obtaining a k-mer set with enrichment/enrichment tendency; then designing a scoring formula, scoring the k-mers in the k-mer set with the enrichment/enrichment tendency obtained by screening by balancing frequency and amplification information to obtain the Score k-mer Based on Score k-mer Design Filter Score aptamer The formula filters out sequences that do not have a pattern sequence according to a user-defined threshold.
More preferably, the scoring formula is as follows:
Figure BDA0002196876890000031
k-mer (i) is the ith k-mer, i =1,2,3
More preferably, the filtered Score is Filter Score aptamer The formula is as follows:
Filter Score aptamer(i)
=max(score k-mer(j) |k-mer(j)∈S aptamer ,S aptamer consisting of all k-mers contained in the aptamer)
aptamer (i) is the ith aptamer, i =1,2,3
Preferably, the step 2 specifically includes: performing pairwise comparison strategy on all aptamer sequences obtained after filtering in the step 1, performing comparison scoring on all aptamer sequences by utilizing BLASTshort, and constructing a nucleic acid correlation map based on comparison scores, wherein the Weight edge(ab) For normalized alignment scores, a markov clustering algorithm (MCL) was used for family classification.
More preferably, the calculation formula of the weight is as follows:
Figure BDA0002196876890000041
wherein, the bit score is the comparison score output by the BLAST comparison algorithm; a, b represent any two vertices.
Preferably, the step 3 specifically includes: and (3) selecting the highest-frequency nucleic acid in each family as a representative sequence based on the family classification result in the step 2, and calculating the Kscore corresponding to the representative sequence according to the k-mer score obtained in the step 2, namely the Kscore of the nucleic acid family.
More preferably, the calculation formula of the nucleic acid family Kscore is as follows:
Kscore aptamer(i) =∑score k-mer(j) ,k-mer(j)∈S aptamer (i),
wherein, kscore aptamer The overall degree of enrichment of the secondary substructure/pattern sequences for each representative sequence of the nucleic acid family is followed by the Kscore for that representative sequence aptamer As the family of nucleic acids, kscore.
Preferably, in the step 4, the calculation formula of the enrichment degree Fscore of the nucleic acid family is as follows:
Figure BDA0002196876890000042
wherein, family (i) represents the ith family, i =1,2,3 … n; fsize is the family size and mean is the average family size.
Preferably, the step 5 specifically includes: derivation of the minimum free energy (dG) of the secondary structure of the aptamer is carried out by using mfold or RNAfold, calculation of the probability (GS) of whether the aptamer is a G tetramer is carried out by using QGRS, and the Sscore is obtained by balancing the minimum free energy and the GS score.
More preferably, the calculation formula of the Sscore is as follows:
Figure BDA0002196876890000043
wherein, family (i) represents the ith family, i =1,2,3 … n; r _ aptamer represents the highest frequency sequence of each family of nucleic acid aptamers.
Preferably, the step 6 specifically includes: selecting two larger scores of the Kscore, the Fscore and the Sscore, adding and averaging to obtain a final MDA-score for evaluating the performance of the functional nucleic acid, and finally selecting a nucleic acid aptamer candidate sequence based on the MDA-score, wherein the higher the score of the MDA-score is, the higher the target binding possibility is considered to be.
More preferably, the MDA-score is calculated as follows:
Figure BDA0002196876890000051
where min _ score represents the minimum of the three scores.
Preferably, in the step 6, the high-performance functional nucleic acid is an aptamer.
The comprehensive nucleic acid identification method based on the multidimensional analysis framework is applied to screening of nucleic acid aptamers.
The invention has the advantages that:
1) The invention provides a multi-dimensional analysis framework for the first time, and based on multi-level structural analysis and sequence analysis, technical means such as unsupervised learning, pattern sequence search, secondary structure simulation and the like are utilized to carry out multi-dimensional evaluation on aptamer sequences, so that aptamers with target binding capacity are identified.
2) The invention carries out high-efficiency filtration of the second-generation high-throughput data for the first time through integrating mode sequence search and scoring, can process hundreds of millions of sequences in a few minutes, has the advantage of efficiently processing the high-throughput data, and improves the overall analysis efficiency.
3) The invention reduces the loss of the low-frequency high-performance aptamer and the error introduction of the high-frequency low-performance aptamer by utilizing multi-dimensional evaluation; and multi-level structure information and multi-level sequence information are considered at the same time, so that the accuracy of the identification result is high.
4) The multi-dimensional scoring formula provided by the invention scores each nucleic acid sequence family by taking the most advantageous two dimensions from the three-dimensional information, so that aptamers with different performances can be captured, for example, aptamers combined through certain substructures and aptamers combined through an integral structure.
Drawings
FIG. 1 is a schematic diagram of a scoring process based on pattern sequence search;
FIG. 2 is a schematic diagram of the 'BLAST-short-MCL' strategy;
FIG. 3 is a schematic diagram of an overall multi-dimensional identification framework of an algorithm;
FIG. 4 shows the flow cytometry results for the shift of the aptamer, epCAM S1-11, to the protein of EpCAM, the individual aptamers were plotted on the abscissa and the median fluorescence intensity on the ordinate.
FIGS. 5-15 are flow cytometrically determined dissociation constants of the aptamer EpCAM S1-11 for the epithelial adhesion factor EpCAM protein, with DNA concentration (nmol/L) on the abscissa and mean fluorescence intensity on the ordinate.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention takes an epithelial cell adhesion molecule EpCAM nucleic acid library as an example to identify an EpCAM aptamer, and specifically comprises the following steps:
step 1: screening of a library of aptamers that specifically bind to the epithelial cell adhesion molecule EpCAM:
step a) dissolving the synthesized 5nmol single-stranded DNA nucleic acid library in a binding buffer solution (12 mmol/L PBS,0.55mmol/L MgCl 2), performing heat treatment, heating at 95 ℃ for 5min, standing on ice for 10min, and then standing at room temperature for 10min;
step b) incubating the treated single-stranded DNA nucleic acid library with Ni microbeads, and collecting liquid which is not combined with the Ni microbeads;
step c) incubating the liquid not bound to the Ni microbeads with EpCAM Ni microbeads for 40min at 37 ℃;
washing the incubated EpCAM Ni microbeads by using a binding buffer solution, and performing PCR reaction on the EpCAM Ni microbeads bound with the oligonucleotides; the PCR reaction program is: pre-denaturation at 94 ℃ for 3min, at 94 ℃ for 30s, at 53 ℃ for 30s, at 68 ℃ for 30s, amplification for 10 cycles, and final extension at 68 ℃ for 5min; primer 1:5'-FAM-AGC GTC GAA TAC CAC TAC AG-3'; primer 2:5'-Biotin-CTG ACC ACG AGC TCC ATT AG-3';
after the PCR reaction is finished, adding streptavidin microbeads into a product which is double-stranded DNA with biotin labels at the 3 'end and FAM labels at the 5' end, reacting for 30min, performing single-stranded reaction by using 0.1mol/L NaOH, and purifying by using a desalting column to obtain a single-stranded DNA library for the next round of screening;
after the step f), using 200pmol of single-stranded DNA library in each round, gradually increasing the washing times to enhance the screening strength, carrying out 12 rounds of screening, detecting the enrichment condition of the single-stranded DNA library by a flow cytometer, and finally sending the 2,3,4,6,7,8 round library to high-throughput sequencing, wherein the 12 th round library is obviously combined with the target protein EpCAM (see figure 1) and not combined with the Ni protein (see figure 2);
and 2, step: nucleic acid library sequencing data filtering based on pattern sequence search:
as shown in FIG. 1, the frequency distribution of "k-mers" is used to screen a collection of "k-mers" (i.e., high frequency "k-mers" appearing in the library, defined as "set 1") with an enrichment/enrichment tendency, where the high frequency "k-mers" are selected to be k-mers with a frequency in each round of the library that is higher than 95% of the score lines in the pre-defined frequency distribution of the control library; plotting the distribution of the amplification multiples of the "k-mers", screening a "set2" of k-mers of the same size as "set1", consisting of the "k-mers" with the maximum amplification multiple; designing a scoring function (formula 1), scoring the k-mers in the two sets by balancing frequency and amplification information, and assigning the rest k-mers as 0; wherein the score of the "k-mer" represents the probability of whether the sequence is a pattern sequence, and the higher the score is, the higher the probability is;
Figure BDA0002196876890000071
k-mer (i) is the ith k-mer, i =1,2,3.. N;
individual nucleic acid sequencing data were scored based on the Score of "k-mer" (Filter Score) aptamer "shown by equation 2), filtering out sequences without pattern sequences according to a user-defined threshold (defined as 10 in the present experiment), wherein 50% -90% of second-generation sequencing data can be filtered (the actual filtering proportion in the present experiment is 92.66%);
Figure BDA0002196876890000072
aptamer (i) is the ith aptamer, i =1,2,3.. N;
and step 3: using the strategy of 'BLAST-short-MCL', library sequence family classification is carried out, and the enrichment degree Fscore of aptamer family is evaluated:
as shown in fig. 2, each two of all aptamer sequences are aligned by using a "BLASTshort" program, and a aptamer correlation map is constructed based on the alignment scores, wherein the weight is the normalized alignment score (formula 3), the family classification is performed by using markov clustering, and the filtered sequencing data is divided into 20,000+ different nucleic acid families;
Figure BDA0002196876890000081
for the scores, a, b are the vertices on the graph, where nucleic acid aptamers are represented;
and 4, step 4: and (3) performing Kscore evaluation on the secondary substructure/pattern sequence content of the aptamer of the library sequence family classified in the step (3):
based on the family classification result in step 3, selecting the highest frequency nucleic acid in each family as a representative sequence, and calculating the Kscore (formula 4) corresponding to the representative sequence according to the k-mer score obtained in step 2, namely the Kscore of the nucleic acid family.
Kscore aptamer(i) =∑score k-mer(j) ,k-mer(j)∈S aptamer(i) , (4)
Wherein, kscore aptamer The overall degree of enrichment of the secondary substructure/pattern sequences for each representative sequence of the nucleic acid family is followed by the Kscore for that representative sequence aptamer As the nucleic acid family Kscore;
and 5, evaluating the aptamer family enrichment degree Fscore of the library sequence family classified in the step 3:
calculating Fscore (equation 5) according to the family size based on the family classification result of the step 3;
Figure BDA0002196876890000082
wherein, family (i) represents the ith family, i =1,2,3 … n; fsize is family size and mean is average family size;
and 6, based on the prediction of the minimum free energy and the G tetramer structure, evaluating the secondary structure stability Sscore of the aptamer:
deducing the minimum free energy (dG) of the secondary structure of the aptamer by using the 'mfold' or 'RNAfold', calculating the possibility (GS) of whether the aptamer is a G tetramer by using the 'QGRS', and weighing the minimum free energy and the 'GS' score to obtain 'Sscore' (formula 6); wherein, each aptamer family selects the highest frequency sequence as a representative sequence (r _ aptamer);
Figure BDA0002196876890000091
wherein, family (i) represents the ith family, i =1,2,3 … n; r _ aptamer represents the highest frequency sequence of each aptamer family;
and 7: the method comprises the following steps of utilizing a multidimensional analysis framework to comprehensively evaluate and weigh the nucleic acid family from three aspects of secondary structure/pattern sequence content (Kscore), enrichment degree of the nucleic acid aptamer family (Fscore) and secondary structure stability (Sscore), and identifying the nucleic acid aptamer with target binding capacity:
as shown in fig. 3, based on the calculated "Kscore", "Fscore", and "Sscore", the larger two scores of "Kscore", "Fscore", and "Sscore" are selected, and are added and averaged to obtain "MDA-core" (formula 7). And finally, selecting a nucleic acid aptamer candidate sequence based on the 'MDA-score', wherein the higher the score of the 'MDA-score', the higher the target binding possibility is considered to be. By such a selection, one can reduce false positives introduced by a single measure (e.g., some aptamers without target binding ability may have a very large family of sequences due to amplification bias); on the other hand, the loss of the aptamers with different performances can be reduced (for example, some high-performance aptamers can exert the binding capacity through a certain substructure, and the stability of the whole secondary structure is not strong); by such a trade-off rule, not only can aptamers with different performances be retained, but also false positives caused by amplification preference, non-specific adsorption and the like can be eliminated.
Figure BDA0002196876890000092
Where "min _ score" represents the minimum of the three scores;
the finally obtained 11 aptamers with the highest score are DNA fragments shown by any sequence of SEQ ID NO 1-SEQ ID NO 11 and are respectively named as EpCAM S1-S11;
and 8: verification of binding capacity of EpCAM aptamer: selecting a nucleic acid aptamer library screened in the 2,3,4,6,7,8 round for second-generation high-throughput sequencing, identifying candidate aptamers by using the steps based on sequencing data, and identifying the binding capacity of the candidate aptamers and target proteins by using a flow analyzer:
step a) first PCR-amplifying a single-stranded DNA with a fluorescent label using primers: 5'-Biotin-CTG ACC ACG AGC TCC ATT AG-3' and primers: 5'-FAM-AGC GTC GAA TAC CAC TAC AG-3', wherein the PCR product is double-stranded DNA with FAM at the 5 'end and biotin at the 3' end, adding streptavidin microbeads, reacting for 30min, performing single-stranded reaction by using 0.1mol/L NaOH, and purifying by a desalting column to obtain single-stranded DNA with FAM markers for flow analysis;
step b) using single-stranded DNA with concentration gradient of 0nmol/L,5nmol/L,10nmol/L,20nmol/L,50nmol/L,100nmol/L and 200nmol/L and target protein EpCAM Ni microbeads to determine dissociation constant, preparing DNA solution with each concentration with 200ul binding buffer solution, heating at 95 ℃ for 5min, respectively placing on ice and at room temperature for 10min in sequence, then adding 155nmol/L EpCAM microbeads, incubating at 37 ℃ for 40min, washing the microbeads 3 times with the binding buffer solution, resuspending the microbeads in 250ul binding buffer solution, and setting the microbeads in an initial DNA random oligonucleotide library subjected to screening as a control;
step c) fluorescence intensity measurement of the beads was performed using a flow cytometer (FIG. 4) of BD company, and the affinity of the aptamer was measured by plotting a dissociation constant Kd using sigma plot software (FIGS. 5 to 15).
By applying the method to the second-generation sequencing data of the EpCAM library, the algorithm evaluates and predicts the target binding capacity of all the sequencing data of the library, and finally obtains 11 first-recognized aptamers with the highest score, wherein the 11 first-recognized aptamers are DNA fragments shown by any sequence of SEQ ID NO. 1-SEQ ID NO. 11 and are respectively named as EpCAM S1-S11; and the binding ability with the target protein is identified by a flow analyzer, as shown in fig. 4, epCAM S1 to S11 have significantly higher fluorescence intensity compared to the library, negative results predicted by the algorithm (sequences filtered by the algorithm, low-score sequences deduced by the algorithm), and random sequences. As shown in FIGS. 5-15, the dissociation between the 11 obtained aptamers and the target protein EpCAM is well-established (Kd: 8-35), and the accuracy and the high efficiency of the algorithm are proved.

Claims (8)

1. A comprehensive nucleic acid identification method based on a multidimensional analysis framework is characterized by comprising the following steps:
step 1: performing nucleic acid library sequencing data filtering based on pattern sequence search;
step 2: performing high-efficiency classification on the sequencing data of the nucleic acid library filtered in the step 1 by using unsupervised learning;
and step 3: based on the library sequence family classification result of the step 2, carrying out Kscore evaluation on the secondary substructure/pattern sequence content of the aptamer;
the method specifically comprises the following steps: based on the family classification result in the step 2, selecting the highest-frequency nucleic acid in each family as a representative sequence, and calculating the Kscore corresponding to the representative sequence according to the k-mer score obtained in the step 2, wherein the Kscore is the nucleic acid family Kscore; the calculation formula of the nucleic acid family Kscore is as follows:
Kscore aptamer(i) =∑score k-mer(j) ,k-mer(j)∈S aptamer(i)
wherein, kscore aptamer The overall enrichment of the secondary substructure/pattern sequences for the representative sequence for each nucleic acid family is followed by the Kscore of the representative sequence aptamer As the nucleic acid family Kscore;
and 4, step 4: based on the classification result of the library sequence family in the step 2, carrying out Fscore evaluation on the enrichment degree of the aptamer family according to the size of the family;
and 5: based on the prediction of the minimum free energy of the secondary structure and the structure of the G tetramer, evaluating the Sscore of the stability of the secondary structure of the aptamer on the classification result of the library sequence family in the step 2;
step 6: and comprehensively evaluating and balancing the functional nucleic acid from three aspects of secondary substructure/pattern sequence content Kscore, aptamer family enrichment degree Fscore and secondary structure stability Sscore by using a multidimensional analysis framework, and identifying the high-performance functional nucleic acid.
2. The method for comprehensively identifying nucleic acids based on a multidimensional analysis framework as claimed in claim 1, wherein the step 2 specifically comprises: performing pairwise comparison strategy on all aptamer sequences obtained after filtering in the step 1, performing comparison scoring on all aptamer sequences by utilizing BLASTshort, and constructing a nucleic acid correlation map based on comparison scores, wherein the Weight edge(ab) Carrying out family classification for the normalized comparison score by using a Markov clustering algorithm; wherein, the calculation formula of the weight is as follows:
Figure FDA0003963870980000021
wherein, bitscore is the comparison score output by the BLAST comparison algorithm; a, b represent any two vertices.
3. The method for comprehensively identifying nucleic acids based on the multidimensional analysis framework as claimed in claim 1, wherein in the step 4, the calculation formula of the enrichment degree Fscore of the nucleic acid family is as follows:
Figure FDA0003963870980000022
wherein, family (i) represents the ith family, i =1,2,3 … n; fsize is the family size and mean is the average family size.
4. The method for comprehensively identifying nucleic acids based on the multidimensional analysis framework as claimed in claim 1, wherein the step 5 specifically comprises: deducing the minimum free energy of the secondary structure of the aptamer by using mfold or RNAfold, calculating the possibility of whether the aptamer is a G tetramer by using QGRS, and weighing the minimum free energy and the GS score to obtain Sscore.
5. The comprehensive nucleic acid identification method based on the multidimensional analysis framework as claimed in claim 4, wherein the calculation formula of Sscore is as follows:
Figure FDA0003963870980000031
wherein, family (i) represents the ith family, i =1,2,3 … n; r _ aptamer represents the highest frequency sequence of each family of nucleic acid aptamers.
6. The method for comprehensively identifying nucleic acids based on the multidimensional analysis framework as claimed in claim 1, wherein the step 6 specifically comprises: selecting two larger scores of the Kscore, the Fscore and the Sscore, adding and averaging to obtain a final MDA-score for evaluating the performance of the functional nucleic acid, and finally selecting a nucleic acid aptamer candidate sequence based on the MDA-score, wherein the higher the score of the MDA-score is, the higher the target binding possibility is considered to be.
7. The method for comprehensive identification of nucleic acids based on multidimensional analysis framework according to claim 6, wherein the MDA-score is calculated according to the following formula:
Figure FDA0003963870980000032
where min _ score represents the minimum of the three scores.
8. Use of the method for comprehensive identification of nucleic acids based on a multidimensional analysis framework according to any one of claims 1 to 7 for screening of aptamers.
CN201910850896.5A 2019-09-10 2019-09-10 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof Active CN110600080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910850896.5A CN110600080B (en) 2019-09-10 2019-09-10 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910850896.5A CN110600080B (en) 2019-09-10 2019-09-10 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof

Publications (2)

Publication Number Publication Date
CN110600080A CN110600080A (en) 2019-12-20
CN110600080B true CN110600080B (en) 2023-04-18

Family

ID=68858432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910850896.5A Active CN110600080B (en) 2019-09-10 2019-09-10 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof

Country Status (1)

Country Link
CN (1) CN110600080B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600080B (en) * 2019-09-10 2023-04-18 上海交通大学医学院附属仁济医院 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006236153A (en) * 2005-02-25 2006-09-07 Dainippon Sumitomo Pharma Co Ltd Functional nucleic acid array analysis method
CN110600080A (en) * 2019-09-10 2019-12-20 上海交通大学医学院附属仁济医院 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof
CN110592093A (en) * 2019-09-10 2019-12-20 上海交通大学医学院附属仁济医院 Aptamer capable of identifying EpCAM protein and preparation method and application thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006236153A (en) * 2005-02-25 2006-09-07 Dainippon Sumitomo Pharma Co Ltd Functional nucleic acid array analysis method
CN110600080A (en) * 2019-09-10 2019-12-20 上海交通大学医学院附属仁济医院 Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof
CN110592093A (en) * 2019-09-10 2019-12-20 上海交通大学医学院附属仁济医院 Aptamer capable of identifying EpCAM protein and preparation method and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SMART-Aptamer-Mannul;songjiajia2018;《https://github.com/songjiajia2018/SMART-Aptamer-v1/blob/2e437a956fec48cf8f66a34fcda09f4336be75c6/SMART-Aptamer-Manual.pdf》;20190806;第1-11页 *

Also Published As

Publication number Publication date
CN110600080A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN106599615B (en) A kind of sequence signature analysis method for predicting miRNA target gene
CN109801680A (en) Tumour metastasis and recurrence prediction technique and system based on TCGA database
Pranzatelli et al. ATAC2GRN: optimized ATAC-seq and DNase1-seq pipelines for rapid and accurate genome regulatory network inference
CN108090327B (en) Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy
Yao et al. plantMirP: an efficient computational program for the prediction of plant pre-miRNA by incorporating knowledge-based energy features
CN110600080B (en) Comprehensive functional nucleic acid identification method based on multi-dimensional analysis framework and application thereof
CN115101128A (en) Method for evaluating off-target risk of hybridization capture probe
CN106874705B (en) The method for determining tumor marker based on transcript profile data
CN110592093B (en) Aptamer capable of recognizing EpCAM protein, and preparation method and application thereof
US20140058682A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
CN113257357A (en) Method for predicting protein residue contact map
CN115050416A (en) Single cell transcriptome calculation analysis method and system fused with deep learning model
US20140019062A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
CN115295079A (en) Long-chain non-coding RNA subcellular localization prediction method based on metagram learning
CN114694746A (en) Plant pri-miRNA coding peptide prediction method based on improved MRMD algorithm and DF model
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
Dai A New method of LncRNA classification based on ensemble learning
CN107038350B (en) Long non-coding RNA target prediction method and system of medicine
Turner et al. rG4detector: convolutional neural network to predict RNA G-quadruplex propensity based on rG4-seq data
Tran et al. Network representation of large-scale heterogeneous RNA sequences with integration of diverse multi-omics, interactions, and annotations data
Ahmed et al. Enhanced framework for miRNA target prediction
Ray et al. Dynamic range-based distance measure for microarray expressions and a fast gene-ordering algorithm
CN117106857B (en) Detection method and kit for plasma free chromatin and application of detection method and kit
WO2013097149A1 (en) Method and device for estimating repeating sequence content of genome
Punitha Extraction of Co-Expressed Degs From Parkinson Disease Microarray Dataset Using Partition Based Clustering Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant