CN113436683A - Method and system for screening candidate inserts - Google Patents

Method and system for screening candidate inserts Download PDF

Info

Publication number
CN113436683A
CN113436683A CN202010207885.8A CN202010207885A CN113436683A CN 113436683 A CN113436683 A CN 113436683A CN 202010207885 A CN202010207885 A CN 202010207885A CN 113436683 A CN113436683 A CN 113436683A
Authority
CN
China
Prior art keywords
coding region
sequence
candidate
determining
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010207885.8A
Other languages
Chinese (zh)
Inventor
黄慧雅
曹玉冰
刘乙齐
郭亚琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Syngentech Co ltd
Original Assignee
Beijing Syngentech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Syngentech Co ltd filed Critical Beijing Syngentech Co ltd
Priority to CN202010207885.8A priority Critical patent/CN113436683A/en
Publication of CN113436683A publication Critical patent/CN113436683A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K16/00Immunoglobulins [IGs], e.g. monoclonal or polyclonal antibodies
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K2317/00Immunoglobulins specific features
    • C07K2317/60Immunoglobulins specific features characterized by non-natural combinations of immunoglobulin fragments
    • C07K2317/62Immunoglobulins specific features characterized by non-natural combinations of immunoglobulin fragments comprising only variable region components
    • C07K2317/622Single chain antibody (scFv)

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Genetics & Genomics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Biochemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Medicinal Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Plant Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for screening candidate inserts, which are used for modifying a nucleic acid sequence of a sample to be tested. The method comprises the following steps: determining an initial gene coding region sequence based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected; performing synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences; comparing the candidate coding region sequence with itself, a host genome and a viral genome and determining a homology high risk region for homology risk scoring of the candidate coding region sequence; and determining a subset of preferred coding region sequences among the set of candidate coding region sequences based on the homology risk scores.

Description

Method and system for screening candidate inserts
Technical Field
The present invention relates to the field of biological information, and in particular, the present invention relates to a method and system for screening candidate inserts.
Background
Viruses often have stable structures, simple genomes, broad-spectrum infection capacity and high-efficiency packaging capacity, and become widely used engineered DNA transport expression vectors. Instead, researchers have used inactivated, attenuated, or engineered viruses as effective vaccines, taking advantage of the immune properties of the virus itself. Furthermore, researchers have engineered viruses into oncolytic viruses that have the ability to replicate and package and specifically achieve tumor killing by exploiting the biological properties of the virus to lyse host cells during amplification. With the progressive research on virus-related research, various engineering virus species become the target of engineering modification at present, and various virus products are applied to clinical treatment. Compared with other virus types, the adenovirus is favored because of the relatively stable structure, the packing capacity of about 9000bp and lower toxic and side effects, and can be used as a non-replicative vector for DNA transfer expression and a replicative vector for oncolytic killing.
Although adenovirus has the advantages as an engineering vector, even a relatively stable adenovirus subtype, such as the commonly used adenovirus type 5, has an increasing occurrence and accumulation of single nucleotide variation, small fragment insertion deletion variation and large fragment structural variation with the increase of the number of generations in culture during large-scale pharmaceutical production. On the other hand, the artificial exogenous sequence introduced in the engineering process may have a highly homologous region with the artificial exogenous sequence and the viral genome, so that the risk of structural variation caused by homologous recombination is increased. This risk of homology further affects the stability of the engineered adenovirus genome. The variation not only affects the yield and purity of pharmaceutical production of the engineering adenovirus, but also may affect the function and safety of the engineering adenovirus. For example, production of replication competent adenovirus in pharmaceutical production of non-replication competent engineered adenovirus, and production of uncontrolled replication competent adenovirus in pharmaceutical production of controlled replication engineered adenovirus are important concerns for quality control.
Disclosure of Invention
The present application is based on the discovery and recognition by the inventors of the following facts and problems:
in the pharmaceutical production process of the engineering virus, the artificial exogenous sequence and the virus genome may have a highly homologous region, so that the risk of structural variation caused by homologous recombination is increased. The inventor carries out informatics analysis on the sequence characteristics of the artificial exogenous sequence and the viral genome, predicts and analyzes the region with homology risk, reduces the homology risk of the sequence by combining the modes of synonymous codon replacement, homologous subtype replacement and the like, and increases the stability of the engineered viral genome.
To this end, in a first aspect of the invention, the invention provides a method of screening for candidate inserts for use in engineering a test sample nucleic acid sequence. According to an embodiment of the invention, the method comprises: (1) determining an initial gene coding region sequence based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected; (2) performing synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences; (3) comparing the candidate coding region sequences to a host genome and a viral genome and determining regions of high risk of homology so as to perform a homology risk scoring on the candidate coding region sequences; (4) determining a subset of preferred coding region sequences among the set of candidate coding region sequences based on the homology risk scores.
According to an embodiment of the present invention, the method may further include at least one of the following additional technical features:
according to an embodiment of the invention, the method further comprises:
(ii) performing a codon pair bias score on the candidate coding region sequence based on codon frequency in the host; or codon bias scoring of the candidate coding region sequence based on codon frequency in the host; or CpG scoring of the candidate coding region sequence based on C base frequency, G base frequency and CpG sequence frequency in the candidate coding region sequence; or based on the A base frequency, the T base frequency and the TpA sequence frequency in the candidate coding region sequence, carrying out TpA scoring on the candidate coding region sequence; or predicting the minimum free energy of RNA secondary structure based on the coding mRNA sequence of the candidate coding region sequence; or scoring the candidate coding region sequence for microsatellite instability based on microsatellite sequences in the candidate coding region sequence.
According to an embodiment of the present invention, the step (4) further comprises: determining a subset of preferred coding region sequences among the set of candidate coding region sequences based on the homology risk scores and including at least one of the following scores,
said codon pair bias score, said codon bias score, said CpG score, said TpA score, said mRNA secondary structure minimum free energy, and said microsatellite instability score.
Wherein the homology risk score, codon pair preference score, codon preference score, CpG score, TpA score, mRNA secondary structure minimum free energy and microsatellite instability score are reduced in priority step by step.
According to the method provided by the embodiment of the invention, a candidate insert (in the embodiment, called homologous sequence) of the coding region sequence of the gene can be obtained, and the candidate insert can effectively reduce the risk of structural variation caused by homologous recombination and improve the stability of a nucleic acid sequence (such as a virus genome) of a sample to be tested.
According to the embodiment of the invention, a sliding window sequence alignment method is adopted, and the length of the sliding window is 12-17 bp.
According to an embodiment of the invention, in step (3), the homology risk score is determined based on:
through sliding window sequence comparison, counting the length and frequency of the longest consistent sequence when the number of mismatches is 0-4 bp respectively, the reference sequence is the self or virus genome or host genome sequence respectively, comparing the number of mismatches, the reference sequence and the longest consistent sequence in sequence from high to low according to the priority, comparing the number of mismatches from 0-4 bp according to the priority, comparing the reference sequence is the self, virus genome and host genome sequence from high to low according to the priority, comparing the length and frequency of the longest consistent sequence from high to low according to the priority, comparing the homology risks of different candidate coding region sequences, wherein the higher the priority and the higher the numerical value are, the higher the homology risk is;
according to an embodiment of the invention, the codon pair preference score is determined based on the following formula:
Figure BDA0002421780560000021
according to an embodiment of the invention, the codon preference score is determined based on the following formula:
Figure BDA0002421780560000031
according to an embodiment of the present invention, the CpG score is determined based on the following formula:
Figure BDA0002421780560000032
according to an embodiment of the invention, the TpA score is determined based on the following formula:
Figure BDA0002421780560000033
according to an embodiment of the invention, the RNA secondary structure minimum free energy prediction is based on software Mfold or vienna RNA.
According to an embodiment of the invention, the microsatellite instability score is determined based on:
for the microsatellite region which is continuously repeated for at least 3 times and the length of the microsatellite sequence is less than the length of the sliding window of the homology risk score, counting the length and frequency of the longest microsatellite region, comparing the length and frequency of the longest microsatellite region from high to low according to priority, wherein the higher the priority and the larger the value, the higher the instability of the microsatellite is.
According to an embodiment of the present invention, in step (1), the initial non-coding region sequence is further determined, and further comprising: (5) obtaining a first homologous subtype sequence of the initial non-coding region sequence; (6) determining respective conserved regions and non-conserved regions based on the initial non-coding region sequence and the first homologous subtype sequence; (7) obtaining a set of candidate non-coding region sequences consisting of a plurality of candidate non-coding region sequences by at least one of random mutagenesis and truncation of the non-conserved region: (8) performing said homology risk scoring and optionally microsatellite instability scoring on said non-coding region sequences to determine a preferred subset of non-coding region sequences among said set of candidate non-coding region sequences.
In a second aspect of the invention, a method of screening for candidate inserts for engineering a test sample nucleic acid sequence is provided. According to an embodiment of the invention, the method comprises:
(a) determining an initial non-coding region sequence based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected;
(b) obtaining a first homologous subtype sequence of the initial non-coding region sequence;
(c) determining respective conserved regions and non-conserved regions based on the initial non-coding region sequence and the first homologous subtype sequence;
(d) obtaining a set of candidate non-coding region sequences consisting of a plurality of candidate non-coding region sequences by at least one of random mutagenesis and truncation of the non-conserved region:
(e) (ii) performing said homology risk score and optionally said microsatellite instability score on said non-coding region sequence;
(f) determining a preferred subset of non-coding region sequences among the set of candidate non-coding region sequences based on the homology risk score and optionally the microsatellite instability score.
According to the method provided by the embodiment of the invention, a candidate insert (in the embodiment, called homologous sequence) of the gene non-coding region sequence can be obtained, and the candidate insert can effectively reduce the risk of structural variation caused by homologous recombination and improve the stability of a nucleic acid sequence (such as a virus genome) of a sample to be tested.
According to an embodiment of the present invention, the nucleic acid sequence of the sample to be tested comprises a viral genome sequence.
In a third aspect of the invention, the invention proposes a computer-readable storage medium having a computer program stored thereon. According to an embodiment of the invention, the program, when executed by a processor, implements the method for screening candidate inserts as described above.
In a fourth aspect of the invention, an electronic device is presented. According to an embodiment of the present invention, the electronic device includes a memory, a processor; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the method for screening candidate inserts described above.
In a fifth aspect, the present invention provides a system for screening candidate inserts for engineering a test sample nucleic acid sequence. According to an embodiment of the invention, the system comprises: the device for determining the sequence of the coding region of the initial gene determines the sequence of the coding region of the initial gene based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected; a candidate coding region sequence determining and collecting device which is connected with the initial gene coding region sequence determining and collecting device and is used for carrying out synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences; the homology risk scoring device is connected with the candidate coding region sequence determining and collecting device and is used for comparing the candidate coding region sequences with a host genome and a virus genome and determining a homology high risk region so as to perform homology risk scoring on the candidate coding region sequences; and optionally codon pair bias scoring means connected to said means for determining a set of candidate coding region sequences for codon pair bias scoring of said candidate coding region sequences based on codon frequency in said host; or a codon preference scoring device connected with the candidate coding region sequence set determining device and used for carrying out codon preference scoring on the candidate coding region sequences based on the codon frequency in the host; or a CpG scoring device which is connected with the device for determining the candidate coding region sequence set and is used for carrying out CpG scoring on the candidate coding region sequence based on the C base frequency, the G base frequency and the CpG sequence frequency in the candidate coding region sequence; or a TpA scoring device which is connected with the candidate coding region sequence set determining device and is used for carrying out TpA scoring on the candidate coding region sequences based on the A base frequency, the T base frequency and the TpA sequence frequency in the candidate coding region sequences; or a mRNA secondary structure minimum free energy prediction device which is connected with the candidate coding region sequence aggregation device and used for carrying out mRNA secondary structure minimum free energy prediction on the candidate coding region sequence based on the candidate coding region sequence mRNA sequence; or a microsatellite instability scoring device, the microsatellite instability scoring device is connected with the candidate coding region sequence gathering device and is used for scoring the microsatellite instability of the candidate coding region sequence based on the microsatellite sequence in the candidate coding region sequence; and a means for determining a preferred coding region sequence subset, said means for determining a preferred coding region sequence subset being connected to said homology risk scoring means and optionally to said codon pair bias scoring means, codon bias scoring means, CpG scoring means, TpA scoring means, mRNA secondary structure minimum free energy prediction means and microsatellite instability scoring means, for determining a preferred coding region sequence subset among said set of candidate coding region sequences based on said homology risk score and optionally said codon pair bias score, said codon bias score, said CpG score, said TpA score, said mRNA secondary structure minimum free energy and said microsatellite instability score. The system according to embodiments of the present invention is suitable for performing the method for screening candidate inserts described above to reduce the risk of sequence homology and increase the stability of the engineered viral genome.
In a sixth aspect, the present invention provides a system for screening candidate inserts for engineering a test sample nucleic acid sequence, the system comprising: means for determining an initial non-coding region sequence; the first homologous subtype sequence acquisition device is connected with the initial non-coding region sequence determination device and is used for acquiring a first homologous subtype sequence of the initial non-coding region sequence; a conservative region and non-conservative region determining device, which is connected with the initial non-coding region sequence determining device and the first homologous subtype sequence acquiring device and is used for determining a conservative region and a non-conservative region respectively based on the initial non-coding region sequence and the first homologous subtype sequence; a candidate non-coding region sequence set determining device which is connected with the conservative region determining device and the non-conservative region determining device, and obtains a candidate non-coding region sequence set composed of a plurality of candidate non-coding region sequences by performing at least one of random mutation and truncation treatment on the non-conservative region: means for determining a subset of preferred non-coding region sequences, said means for determining a subset of preferred non-coding region sequences being connected to said means for determining a set of candidate non-coding region sequences for which a subset of preferred non-coding region sequences is determined, said means for determining a homology risk score and optionally said microsatellite instability score for said non-coding region sequences.
In a seventh aspect of the invention, the invention provides a UAS sequence. According to the embodiment of the invention, the sequence has a nucleotide sequence shown as SEQ ID NO 1, 4-6, 14, 17, 23, 25, 31 and 34. The inventor screens the homologous sequence of the UAS by the method, and finds that the nucleotide sequence shown by SEQ ID NO 1, 4-6, 14, 17, 23, 25, 31 and 34 has stronger expression strength and can be used for constructing the combined UAS.
In an eighth aspect of the invention, the invention proposes a set of UAS combinations. According to an embodiment of the invention, the combination comprises a series of UAS sequences having nucleotide sequences shown in SEQ ID NO 4-6 and 17; or the nucleotide sequences shown in SEQ ID NO. 1, 23, 31 and 34; or the nucleotide sequences shown in SEQ ID NO. 1, 4, 5, 6 and 14; or the nucleotide sequences shown in SEQ ID NO 17, 23, 25, 31 and 34. The inventor screens UAS homologous sequences by the method, and finds that the UAS downstream promoter expression intensity of the nucleotide sequences shown in SEQ ID NO. 1, 4-6, 14, 17, 23, 25, 31 and 34 is stronger, when the UAS sequence combination is adopted to construct a mutual inhibition switch type gene circuit, the expression difference multiples of the four pairs of combined promoters constructed in the switch state are all maintained by more than 10 times compared with the original promoter switch type gene circuit, but the 15-generation variation rate of virus derived from the promoters is reduced from 0.1 to an undetected level, so that the UAS combination according to the embodiment of the invention can effectively reduce the homology risk while maintaining the expression intensity of the downstream promoter.
According to an embodiment of the present invention, the UAS combination has a UAS connection order from 5 'end to 3' end as follows:
B1-A5-A6-A4(SEQ ID NO:39), D5-A1-D2-C1(SEQ ID NO:41), A5-A1-A6-A4-A14(SEQ ID NO:42) or B1-D5-D2-C1-C3(SEQ ID NO: 44).
In a ninth aspect of the invention, the invention provides a Ni-scFv gene sequence. According to an embodiment of the invention, the sequence has the nucleotide sequence shown as SEQ ID NO 46 or 47. The inventor screens the homologous sequence of the Ni-scFv gene by the method, and finds that the Ni-scFv gene with the nucleotide sequence shown in SEQ ID NO. 46 or 47 has the longest non-mismatching homologous region length reduced, the 15-generation variation rate is reduced from 0.01 to an undetected level, and the effective content has NO significant difference compared with that before optimization, so that the Ni-scFv gene with the nucleotide sequence shown in SEQ ID NO. 46 or 47 effectively reduces the homology risk while maintaining the function of the target gene.
In a tenth aspect of the invention, the invention provides an At-scFv gene sequence. According to an embodiment of the invention, the sequence has the nucleotide sequence shown in SEQ ID NO 49 or 50. The inventors screened the homologous sequence of the At-scFv gene by the aforementioned method, and found that the At-scFv gene having the nucleotide sequence shown in SEQ ID NO. 49 or 50 had the longest length of the mismatch-free homologous region, had the 15-generation variation rate decreased from 0.01 to an undetected level, and had NO significant difference in effective content compared to that before optimization, and thus, the At-scFv gene having the nucleotide sequence shown in SEQ ID NO. 49 or 50 had the homology risk effectively reduced while maintaining the function of the target gene.
Drawings
FIG. 1 is a flow chart of optimization of viral homology sequence analysis;
FIG. 2 shows the distribution of the binding site pattern of Gal4 protein (from 10.1093/nar/gkl 857);
FIG. 3 is a 4x UAS and 5x UAS homologous subtype promoter expression intensity test;
FIG. 4 is a UAS homologous subtype combined promoter expression intensity test;
FIG. 5 is a test of expression difference between switch states of UAS homologous subtype combined promoter pairs for constructing mutually inhibited switch type gene circuits;
FIG. 6 is a test of engineering virus function and genome stability before and after UAS optimization;
FIG. 7 shows the engineering virus function and genome stability test before and after optimization of the target gene sequence;
FIG. 8 is a schematic diagram of a system for screening candidate inserts, according to an embodiment of the present invention; and
FIG. 9 is a schematic structural diagram of a system for screening candidate inserts according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The invention provides a system for screening candidate inserts, which are used for modifying a nucleic acid sequence of a sample to be tested. According to an embodiment of the invention, with reference to fig. 8, the system comprises: the device 100 for determining the sequence of the coding region of the initial gene comprises a device 100 for determining the sequence of the coding region of the initial gene, wherein the device 100 for determining the sequence of the coding region of the initial gene is used for determining the sequence of the coding region of the initial gene based on a candidate gene for transforming a nucleic acid sequence of a sample to be tested; a candidate coding region sequence set determining device 200, wherein the candidate coding region sequence set determining device 200 is connected with the initial gene coding region sequence determining device 100 and is used for performing synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences; homology risk scoring means 300, said homology risk scoring means 300 connected to said candidate coding region sequence aggregation means 200 for aligning said candidate coding region sequences with the host genome and the viral genome and determining regions of high risk of homology in order to perform homology risk scoring on said candidate coding region sequences; and optionally codon pair bias scoring means connected to said means for determining a set of candidate coding region sequences for codon pair bias scoring of said candidate coding region sequences based on codon frequency in said host; or a codon preference scoring device connected with the candidate coding region sequence set determining device and used for carrying out codon preference scoring on the candidate coding region sequences based on the codon frequency in the host; or a CpG scoring device which is connected with the device for determining the candidate coding region sequence set and is used for carrying out CpG scoring on the candidate coding region sequence based on the C base frequency, the G base frequency and the CpG sequence frequency in the candidate coding region sequence; or a TpA scoring device which is connected with the candidate coding region sequence set determining device and is used for carrying out TpA scoring on the candidate coding region sequences based on the A base frequency, the T base frequency and the TpA sequence frequency in the candidate coding region sequences; or a mRNA secondary structure minimum free energy prediction device which is connected with the candidate coding region sequence aggregation device and used for carrying out mRNA secondary structure minimum free energy prediction on the candidate coding region sequence based on the candidate coding region sequence mRNA sequence; or a microsatellite instability scoring device, the microsatellite instability scoring device is connected with the candidate coding region sequence gathering device and is used for scoring the microsatellite instability of the candidate coding region sequence based on the microsatellite sequence in the candidate coding region sequence; and a means 400 for determining a preferred coding region sequence subset, said means for determining a preferred coding region sequence subset being connected to said homology risk scoring means 300 and optionally to said codon pair bias scoring means, codon bias scoring means, CpG scoring means, TpA scoring means, mRNA secondary structure minimum free energy prediction means and microsatellite instability scoring means, for determining a preferred coding region sequence subset among said set of candidate coding region sequences based on said homology risk score and optionally said codon pair bias score, said codon bias score, said CpG score, said TpA score, said mRNA secondary structure minimum free energy and said microsatellite instability score. The system according to embodiments of the present invention is suitable for performing the method for screening candidate inserts described above to reduce the risk of sequence homology and increase the stability of the engineered viral genome.
In a sixth aspect of the invention, the invention provides a system for screening candidate inserts for engineering a test sample nucleic acid sequence, with reference to fig. 9, the system comprising: means 500 for determining an initial non-coding region sequence; a first homologous subtype sequence acquiring device 600, wherein the first homologous subtype sequence acquiring device 600 is connected to the initial non-coding region sequence determining device 500, and is used for acquiring a first homologous subtype sequence of the initial non-coding region sequence; a conservative region and non-conservative region determining device 700, wherein the conservative region and non-conservative region determining device 700 is connected 600 with the initial non-coding region sequence determining device 500 and the first homologous subtype sequence obtaining device, and is used for determining a conservative region and a non-conservative region respectively based on the initial non-coding region sequence and the first homologous subtype sequence; a candidate non-coding region sequence set determining device 800, wherein the candidate non-coding region sequence set determining device 800 is connected with the conservative region and non-conservative region determining device 700, and obtains a candidate non-coding region sequence set composed of a plurality of candidate non-coding region sequences by performing at least one of random mutation and truncation processing on the non-conservative region: means 900 for determining a subset of preferred non-coding region sequences, said means 900 for determining a subset of preferred non-coding region sequences being connected 800 to said means for determining a set of candidate non-coding region sequences for said homology risk scoring and optionally said microsatellite instability scoring of said non-coding region sequences, a subset of preferred non-coding region sequences being determined in said set of candidate non-coding region sequences.
The present application is based on the discovery and recognition by the inventors of the following facts and problems:
in the pharmaceutical production process of the engineering virus, the artificial exogenous sequence and the virus genome may have a highly homologous region, so that the risk of structural variation caused by homologous recombination is increased. The inventor carries out informatics analysis on the sequence characteristics of the artificial exogenous sequence and the viral genome, predicts and analyzes the region with homology risk, reduces the homology risk of the sequence by combining the modes of synonymous codon replacement, homologous subtype replacement and the like, and increases the stability of the engineered viral genome.
Example 1
The specific process is as follows:
obtaining a reference genome sequence of an engineering virus (such as adenovirus) and an artificial exogenous sequence to be optimized by means of literature research, first-generation sequencing and the like. And obtaining the human genome codon frequency and the codon pair frequency by means of data arrangement and the like.
1) Carrying out sliding window sequence comparison on the artificial exogenous sequence, the self, the human genome and the engineering adenovirus genome sequence by using short sequence comparison software (such as Bowtie), wherein the length of a sliding window is 12-17 bp, the mismatching number is 0-4 bp, the length and frequency of the longest consistent sequence are output, and scoring is carried out; comparing the number of mismatches, the reference sequence and the longest consistent sequence in sequence from high to low according to the priority, comparing the number of mismatches from 0 to 4bp from high to low according to the priority, comparing the reference sequence as the sequence of the self, the viral genome and the host genome sequence from high to low according to the priority, comparing the length and frequency of the longest consistent sequence from high to low according to the priority, and comparing the homology risks of different candidate coding region sequences, wherein the higher the priority and the larger the numerical value are, the higher the homology risk is; according to the experiment experience of adenovirus, the region with the longest consistent sequence length of more than 15bp and mismatch of 0bp is a region with high homology risk.
2) And (3) carrying out random substitution of synonymous codons complying with the frequency distribution of the human genome codons on the sequences positioned in the gene coding regions, and filtering random sequences comprising the restriction enzyme site sequences, homologous cloning sequences and barcode sequences which influence the sequence cloning and the vector construction to obtain 5000 alternative optimized sequences with invariable coded amino acids.
3) The alternative sequences were individually subjected to the homology risk scoring of step 1). Respectively carrying out codon pair frequency statistics on the alternative sequences, and carrying out codon pair preference scoring and codon preference scoring by combining the codon pair preference of the human genome in a scoring mode
Figure BDA0002421780560000091
Respectively carrying out CpG and TpA sequence characteristic frequency statistics on the alternative sequences and scoring in the mode of
Figure BDA0002421780560000092
And respectively carrying out mRNA secondary structure minimum free energy prediction on the alternative sequences by using software MFold or Vienna RNA.
Respectively carrying out microsatellite sequence characteristic statistics on the alternative sequences and scoring, counting the length and frequency of the longest microsatellite region for the microsatellite region which is continuously repeated at least 3 times and has the microsatellite sequence length less than the length of a homology risk score sliding window, comparing the length and frequency of the longest microsatellite region from high to low according to priority, wherein the higher the priority is, the higher the numerical value is, the higher the instability of the microsatellite is.
And sequencing the alternative sequences based on the scores, screening the alternative sequences with the scores of 50 at the top, and selecting the alternative sequences from the alternative sequences to perform experimental verification on the biological function and the engineering adenovirus genome stability.
4) For sequences positioned in other control regions, homologous subtype sequences with similar biological functions are obtained through literature mining; analyzing by sequence similarity search software (such as Blast) to obtain a sequence conservative region; other alternative homologous subtype sequences are obtained by carrying out random mutation, truncation and the like on the non-conservative region, and random sequences comprising enzyme cutting site sequences, homologous clone sequences and barcode sequences which influence sequence cloning and vector construction are filtered.
5) And (3) respectively carrying out homology risk scoring and microsatellite instability scoring of the step 1) on the alternative sequences and sequencing, and selecting the alternative sequences from the alternative sequences to carry out experimental verification on biological functions and engineering adenovirus genome stability.
6) Finally, the homologous optimized sequence with similar functions of the sequence and the virus with passage stability is screened.
A schematic flow diagram according to an embodiment of the invention is shown in fig. 1.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Gal4/UAS is a gene expression control system present in yeast. The binding domain of the transcription regulator Gal4 can bind to UAS sequences (transcription factor binding site upstream of the promoter), while the activation domain can activate downstream promoters. Multiple UAS in tandem can effectively enhance the binding capacity of Gal4 to enhance the activation strength for downstream promoters. In viral vectors, homologous sequences of UAS may cause recombination variations that are non-functional.
Thus, the inventors performed homologous sequence optimization for UAS.
Example 2UAS homologous sequence optimization
Experiment one
Detecting the expression strength of the downstream promoter of the homologous subtype of 4x UAS and 5x UAS
The sequence pattern of the binding site of Gal4 protein is shown in FIG. 2, which can be obtained by bioinformatic analysis. And respectively referring to different wild type Gal4 protein binding sites, carrying out homologous subtype replacement on a non-conserved region, and scoring the homology risk to obtain a series of UAS homologous subtype sequences shown in SEQ ID NO 1-37.
>A1
CGGAGTACTGTCCTCCG(SEQ ID NO:1)。
>A2
CGGAGGACTGTCCTCCG(SEQ ID NO:2)。
>A3
CGGAGAACTGTTCTCCG(SEQ ID NO:3)。
>A4
CGGAGCACTGTGCTCCG(SEQ ID NO:4)。
>A5
CGGACTACTGTAGTCCG(SEQ ID NO:5)。
>A6
CGGACGACTGTCGTCCG(SEQ ID NO:6)。
>A7
CGGACAACTGTTGTCCG(SEQ ID NO:7)。
>A8
CGGACCACTGTGGTCCG(SEQ ID NO:8)。
>A9
CGGAATACTGTATTCCG(SEQ ID NO:9)。
>A10
CGGAAGACTGTCTTCCG(SEQ ID NO:10)。
>A11
CGGAAAACTGTTTTCCG(SEQ ID NO:11)。
>A12
CGGAACACTGTGTTCCG(SEQ ID NO:12)。
>A13
CGGATTACTGTAATCCG(SEQ ID NO:13)。
>A14
CGGATGACTGTCATCCG(SEQ ID NO:14)。
>A15
CGGATAACTGTTATCCG(SEQ ID NO:15)。
>A16
CGGATCACTGTGATCCG(SEQ ID NO:16)。
>B1
CGGGGTGCCGCCCCCCG(SEQ ID NO:17)。
>B2
CGGGATGCCGCATCCCG(SEQ ID NO:18)。
>B3
CGGGAAGCCGCTTCCCG(SEQ ID NO:19)。
>B4
CGGGTTGCCGCAACCCG(SEQ ID NO:20)。
>B5
CGGGTAGCCGCTACCCG(SEQ ID NO:21)。
>B6
CGGGCCGCCGCGTCCCG(SEQ ID NO:22)。
>C1
CGGAGTGCCGCACTCCG(SEQ ID NO:23)。
>C2
CGGGGTACTGTACCCCG(SEQ ID NO:24)。
>C3
CGGAGTGCTGTACTCCG(SEQ ID NO:25)。
>C4
CGGAGTACCGTACTCCG(SEQ ID NO:26)。
>C5
CGGGGTGCTGCACCCCG(SEQ ID NO:27)。
>C6
CGGGGTACCGTACCCCG(SEQ ID NO:28)。
>C7
CGGGGTGCCGCACCCCG(SEQ ID NO:29)。
>D1
CGGTCCACTGTGTGCCG(SEQ ID NO:30)。
>D2
CGGGTGACAGCCCTCCG(SEQ ID NO:31)。
>D3
CGGCCATATGTCTTCCG(SEQ ID NO:32)。
>D4
CGGCGGTCTTTCGTCCG(SEQ ID NO:33)。
>D5
CGGGTGACCGCCCTCCG(SEQ ID NO:34)。
>D6
CGGTCCACAGTGTGCCG(SEQ ID NO:35)。
>D7
CGGCCATATCGCTTCCG(SEQ ID NO:36)。
>D8
CGGGTGACAGCCCTCCG(SEQ ID NO:37)。
UAS homologous subtypes A1, C3 and D1-D5 are used for constructing a 4x promoter, target cells are transfected respectively to test the expression intensity of the original promoter and each subtype promoter reporter gene, the expression intensity of the original promoter is normalized to be relative expression intensity, and the expression difference between each subtype and the original promoter is compared.
UAS homologous subtypes A2-A16, B1-B5, C1-C6 and D1-D2 are used for constructing a 5x promoter, target cells are transfected respectively to test the expression strength of an original promoter and each subtype promoter reporter gene, the expression strength of the original promoter is normalized to be relative expression strength, and the expression difference between each subtype and the original promoter is compared.
The detection results are shown in FIG. 3. Among the homologous subtypes, A1, A4, A5, A6, A14, B1, C1, C3, D2 and D5 have stronger relative expression strength and can be used for constructing a combined UAS promoter.
Experiment two
Detecting the expression intensity of the downstream promoter of different homologous subtypes of the combination of 4x UAS and 5x UAS
According to the homology between the result of the experiment and the sequence, constructing 4x UAS and 5x UAS combined promoters as shown in table 2, adding two different inhibition sites, respectively transfecting target cells to test the expression strength of the original promoter and the reporter gene of each subtype combined promoter, normalizing the expression strength of the original promoter into relative expression strength, and comparing the expression difference between each subtype combined promoter and the original promoter.
Table 2:
combined promoter UAS homologous subtype
Combination A A5-A1-A6-A4
Combination B B1-A5-A6-A4
Combination C B1-D5-D2-C1
Combination D D5-A1-D2-C1
Combination E A5-A1-A6-A4-A14
Combination F B1-A5-A6-A4-A14
Combination G B1-D5-D2-C1-C3
Combination H D5-A1-D2-C1-C3
>A5-A1-A6-A4
CGGACTACTGTAGTCCGCGGAGTACTGTCCTCCGCGGACGACTGTCGTCCGCGGAGCACTGTGCTCCG(SEQ ID NO:38)。
>B1-A5-A6-A4
CGGGGTGCCGCCCCCCGCGGACTACTGTAGTCCGCGGACGACTGTCGTCCGCGGAGCACTGTGCTCCG(SEQ ID NO:39)。
>B1-D5-D2-C1
CGGGGTGCCGCCCCCCGCGGGTGACCGCCCTCCGCGGGTGACAGCCCTCCGCGGAGTGCCGCACTCCG(SEQ ID NO:40)。
>D5-A1-D2-C1
CGGGTGACCGCCCTCCGCGGAGTACTGTCCTCCGCGGGTGACAGCCCTCCGCGGAGTGCCGCACTCCG(SEQ ID NO:41)。
>A5-A1-A6-A4-A14
CGGACTACTGTAGTCCGCGGAGTACTGTCCTCCGCGGACGACTGTCGTCCGCGGAGCACTGTGCTCCGCGGATGACTGTCATCCG(SEQ ID NO:42)。
>B1-A5-A6-A4-A14
CGGGGTGCCGCCCCCCGCGGACTACTGTAGTCCGCGGACGACTGTCGTCCGCGGAGCACTGTGCTCCGCGGATGACTGTCATCCG(SEQ ID NO:43)。
>B1-D5-D2-C1-C3
CGGGGTGCCGCCCCCCGCGGGTGACCGCCCTCCGCGGGTGACAGCCCTCCGCGGAGTGCCGCACTCCGCGGAGTGCTGTACTCCG(SEQ ID NO:44)。
>D5-A1-D2-C1-C3
CGGGTGACCGCCCTCCGCGGAGTACTGTCCTCCGCGGGTGACAGCCCTCCGC GGAGTGCCGCACTCCGCGGAGTGCTGTACTCCG(SEQ ID NO:45)。
The combined promoter pairs added with different inhibition sites are used for starting the expression of corresponding inhibitors, a mutual inhibition switch type gene circuit is constructed (the construction method refers to a patent PCT/CN2017/096043), target cells are respectively transfected to test the expression difference multiple of the original promoter pairs and the expression difference multiple of the subtype combined promoters to the switch states of the reporter genes in the switch type gene circuit, and the expression difference multiple of the subtype combined promoter pairs and the expression difference multiple of the original promoter pairs to the switch states of the switch type gene circuit are compared.
The detection results are shown in FIGS. 4 to 5. Among the homologous subtype combined promoters, the combination B, the combination D, the combination E and the combination G have stronger expression strength and are used for constructing a mutual inhibition switch type gene circuit. Compared with the original promoter switch type gene circuit, the expression difference multiples of the four pairs of the constructed combined promoter pairs between the switch states are all maintained to be more than 10 times.
Experiment three
Detection of differences in function and stability of engineered viruses before and after promoter optimization
Carrying out passage 15 generations on the engineering virus loaded with the original promoter switch type gene circuit, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; carrying out passage 15 generations on the engineering virus of the switch type gene circuit by loading the combined B-combined D promoter, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; after two kinds of equivalent engineering viruses infect target cells, measuring the functional effective content of a target gene product started by a promoter, and normalizing the functional effective content into relative expression strength according to the level before the promoter is optimized; comparing the promoter expression strength before and after optimization of sequence homology with the virus genome stability.
The detection results are shown in FIG. 6. Through the optimization of the promoter sequence, the variation rate of 15 generations of viruses derived from the promoter is reduced from 0.1 to an undetected level, and the effective content of the expressed target gene has no significant difference compared with that before the optimization. Therefore, through the optimization of the homologous sequence, the homology risk is effectively reduced while the expression strength of the promoter is maintained.
Programmed death receptor 1(PD-1) and its ligand (PD-L1) are important immune check point molecules, and the binding of the PD-1 molecule on the surface of T cells and ligand PD-L1 (usually on the surface of tumor) can inhibit the activation and proliferation of T cells and induce the apoptosis of T cells, and is an important mechanism of immunosuppression of tumor microenvironment. The single-chain antibody (scFv) is used for combining PD-1 or PD-L1 so as to prevent the combination of the two, so that the immune effect in a tumor microenvironment can be effectively enhanced, and the scFv can be used as a small-molecule targeted drug alone or loaded on other targeted drug carriers. In viral vectors, homologous sequences in the scFv may cause recombination variations and loss of function. In this application, a single-chain antibody targeting PD-1 is referred to as Ni-scFv and one targeting PD-L1 is referred to as At-scFv.
Thus, the inventors performed homologous sequence optimization for the Ni-scFv gene and the At-scFv gene.
EXAMPLE 3 Gene homology sequence optimization for purposes
The At-scFv targeting PD-L1 and the Ni-scFv gene targeting PD-1 are subjected to homologous sequence optimization respectively.
Experiment one
Ni-scFv gene homologous sequence optimization
The homologous subtype sequences of the Ni-scFv genes obtained by synonymous codon replacement and according to the method of example 1 are shown in SEQ ID NO. 46-48
>Ni-scFv-a
ATGTCCGTTCCAACCCAGGTTCTCGGCCTCCTACTGCTGTGGCTCACCGATGCCAGATGCCAGGTGCAATTGGTAGAGTCCGGGGGCGGGGTGGTGCAGCCAGGGCGATCTTTGCGGTTAGACTGCAAGGCATCCGGAATAACCTTTTCTAATAGCGGGATGCATTGGGTGAGACAAGCACCCGGAAAAGGCCTTGAATGGGTAGCAGTCATATGGTATGACGGTAGCAAACGCTACTATGCAGACTCGGTCAAGGGTCGATTCACTATTAGTCGCGATAACAGCAAGAATACCTTGTTCCTGCAGATGAACTCACTACGGGCTGAAGACACAGCTGTCTATTACTGTGCAACCAACGACGATTATTGGGGACAGGGGACCTTGGTCACCGTCAGTTCCGGGGGAGGGGGATCCGGAGGCGGCGGCTCAGGAGGAGGTGGCTCCGAGATCGTACTGACTCAGAGCCCGGCAACTTTGTCTCTGTCGCCTGGCGAGCGGGCCACACTGTCCTGTAGGGCCTCACAGAGCGTGAGTTCATATCTAGCTTGGTATCAACAGAAGCCGGGGCAGGCCCCTAGATTGCTAATCTACGATGCAAGTAACAGAGCAACTGGCATCCCCGCGAGATTTAGCGGATCCGGTTCCGGAACCGACTTTACACTCACTATCTCCTCCCTAGAACCAGAAGATTTTGCAGTCTACTATTGCCAACAGTCGTCCAACTGGCCTCGCACATTTGGGCAGGGAACCAAGGTTGAAATTAAA(SEQ ID NO:46)。
>Ni-scFv-b
ATGTCCGTTCCCACCCAGGTGCTCGGCCTCCTATTACTCTGGCTAACTGATGCCCGGTGTCAGGTGCAGCTCGTGGAGAGTGGAGGGGGTGTGGTTCAGCCGGGTCGGTCACTGCGGCTGGACTGTAAAGCTAGCGGTATCACGTTCAGTAACTCAGGTATGCACTGGGTACGGCAGGCCCCCGGCAAAGGCTTGGAGTGGGTTGCTGTGATATGGTACGATGGTTCTAAAAGGTATTATGCTGATTCCGTGAAGGGCCGGTTTACCATATCACGCGACAACTCCAAGAACACGCTCTTCCTCCAGATGAATTCACTCCGAGCGGAAGACACCGCGGTTTATTATTGCGCCACCAATGATGACTACTGGGGCCAGGGCACCTTGGTTACCGTGTCTAGCGGAGGGGGGGGCTCGGGGGGCGGCGGTAGCGGTGGAGGTGGGTCCGAGATCGTCCTCACGCAATCCCCCGCCACCCTTAGTCTCAGCCCTGGGGAGCGGGCAACCCTTAGCTGCCGGGCGTCCCAGTCAGTCAGTTCCTACCTTGCCTGGTACCAGCAAAAGCCCGGCCAGGCACCTCGCCTCCTTATTTATGATGCATCGAACCGAGCAACCGGAATTCCTGCGCGGTTCAGTGGTTCTGGTAGCGGGACCGACTTCACACTTACAATATCTAGCCTAGAACCAGAAGACTTCGCTGTCTACTACTGCCAACAGAGCTCGAACTGGCCTAGAACATTCGGGCAGGGCACCAAGGTAGAAATCAAA(SEQ ID NO:47)。
>Ni-scFv-c
ATGAGCGTGCCCACCCAGGTGCTGGGCCTGCTGCTGCTGTGGCTGACCGACGCCAGGTGCCAGGTGCAGCTGGTGGAGAGCGGCGGCGGCGTGGTGCAGCCCGGCAGGAGCCTGAGGCTGGACTGCAAGGCCAGCGGCATCACCTTCAGCAACAGCGGCATGCACTGGGTGAGGCAGGCCCCCGGCAAGGGCCTGGAGTGGGTGGCCGTGATCTGGTACGACGGCAGCAAGAGGTACTACGCCGACAGCGTGAAGGGCAGGTTCACCATCAGCAGGGACAACAGCAAGAACACCCTGTTCCTGCAGATGAACAGCCTGAGGGCCGAGGACACCGCCGTGTACTACTGCGCCACCAACGACGACTACTGGGGCCAGGGCACCCTGGTGACCGTGAGCAGCGGCGGCGGCGGCAGCGGCGGCGGCGGCAGCGGCGGCGGCGGCAGCGAGATCGTGCTGACCCAGAGCCCCGCCACCCTGAGCCTGAGCCCCGGCGAGAGGGCCACCCTGAGCTGCAGGGCCAGCCAGAGCGTGAGCAGCTACCTGGCCTGGTACCAGCAGAAGCCCGGCCAGGCCCCCAGGCTGCTGATCTACGACGCCAGCAACAGGGCCACCGGCATCCCCGCCAGGTTCAGCGGCAGCGGCAGCGGCACCGACTTCACCCTGACCATCAGCAGCCTGGAGCCCGAGGACTTCGCCGTGTACTACTGCCAGCAGAGCAGCAACTGGCCCAGGACCTTCGGCCAGGGCACCAAGGTGGAGATCAAG(SEQ ID NO:48)。
Carrying out passage 15 generations on the engineering virus loaded with the target gene, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; carrying out passage 15 generations on the engineering virus loaded with the target gene after homology optimization, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; after two kinds of equivalent engineering viruses infect target cells, measuring the functional effective content of a target gene product, and normalizing the functional effective content into relative effective content according to the level before optimization of the target gene; comparing the target gene function before and after optimization of sequence homology with the stability of the virus genome.
The results are shown in Table 3 and FIG. 7. Before the optimization of a target gene sequence, the longest non-mismatching homologous region with the self sequence is 16bp, which is a high-homology risk region, and the variation rate can be detected to be 0.01 after passage for 15 generations; the longest non-mismatch homologous region with the virus backbone sequence is 14bp, high homology risk is not reached, and no variation is detected after passage for 15 generations; the optimal Ni-scFv gene homologous subtype sequences Ni-scFv-a and Ni-scFv-b obtained by optimizing the target gene sequence have the longest non-mismatch homologous region length reduced, the 15-generation variation rate is reduced from 0.01 to an undetected level, and the effective content has no obvious difference compared with that before optimization. Therefore, the homology risk is effectively reduced while the function of the target gene is maintained through the homology sequence optimization.
Table 3:
Figure BDA0002421780560000171
experiment two
The At-scFv gene homologous subtype sequences obtained by synonymous codon substitution and according to the method of example 1 are shown in SEQ ID NO: 49-51.
>At-scFv-a
ATGTCCGTCCCTACTCAGGTACTAGGCCTCCTCTTGCTATGGCTGACCGACGCTAGATGCGAAGTGCAACTGGTCGAATCCGGCGGTGGCTTGGTCCAGCCAGGCGGATCATTACGCCTGTCTTGTGCAGCATCAGGCTTCACCTTTAGTGACAGTTGGATCCATTGGGTCCGGCAAGCCCCAGGCAAAGGGCTGGAATGGGTCGCCTGGATTAGCCCATATGGCGGCAGCACCTATTACGCCGACAGCGTCAAGGGGCGCTTTACCATTAGTGCTGACACAAGTAAGAATACCGCTTATCTGCAGATGAATAGCCTGCGGGCCGAAGACACGGCTGTTTACTACTGTGCACGACGCCACTGGCCGGGCGGTTTTGACTATTGGGGACAAGGAACTCTCGTGACAGTCTCGTCGGGAGGCGGCGGTTCAGGTGGCGGCGGCTCAGGAGGGGGGGGTTCTGATATACAGATGACACAATCCCCTTCTTCTCTGAGCGCAAGTGTGGGCGATCGGGTAACCATCACCTGTCGGGCTTCCCAGGACGTGAGTACAGCCGTGGCTTGGTATCAACAGAAGCCAGGCAAGGCCCCGAAGCTGCTAATCTACAGCGCTAGTTTCCTGTACTCAGGGGTGCCGAGCCGCTTTAGCGGAAGTGGATCTGGTACTGACTTCACACTTACTATCAGTTCTCTACAGCCGGAAGACTTCGCTACATATTACTGCCAGCAGTATCTCTATCACCCCGCTACTTTTGGACAAGGAACAAAAGTTGAGATCAAG(SEQ ID NO:49)。
>At-scFv-b
ATGTCAGTCCCCACCCAGGTCCTTGGACTACTCCTTCTATGGTTGACAGATGCCCGCTGCGAAGTCCAGTTGGTGGAATCCGGGGGCGGCCTTGTACAGCCCGGGGGGAGCCTGAGACTCAGCTGTGCCGCTTCAGGATTTACCTTCTCTGATTCCTGGATACATTGGGTACGTCAGGCCCCTGGGAAGGGATTGGAGTGGGTGGCCTGGATCAGTCCATACGGTGGCTCTACGTACTATGCGGACAGCGTCAAAGGGCGCTTTACTATTAGTGCAGATACATCGAAGAATACAGCCTACCTGCAGATGAATTCATTGAGGGCAGAGGACACTGCCGTCTATTACTGTGCAAGAAGGCACTGGCCCGGCGGCTTCGACTATTGGGGACAGGGCACCCTGGTCACAGTATCTTCAGGCGGAGGAGGTTCCGGGGGCGGCGGCTCCGGCGGTGGAGGCTCAGATATCCAGATGACACAGAGCCCCAGCTCTTTATCAGCTTCAGTGGGCGATCGGGTCACCATCACTTGTCGTGCCTCTCAGGATGTGTCTACCGCCGTGGCCTGGTATCAACAGAAACCCGGCAAGGCCCCAAAATTACTGATATATAGTGCAAGCTTCCTGTACTCAGGAGTCCCTTCACGCTTCTCCGGCTCGGGCAGTGGGACCGACTTTACTCTGACGATATCCAGTCTGCAGCCTGAAGATTTCGCTACCTACTACTGTCAACAGTACTTGTACCACCCCGCAACATTCGGACAGGGGACCAAAGTAGAAATTAAA(SEQ ID NO:50)。
>At-scFv-c
ATGAGCGTGCCCACCCAGGTGCTGGGCCTGCTGCTGCTGTGGCTGACCGACGCCAGGTGCGAGGTGCAGCTGGTGGAGAGCGGCGGCGGCCTGGTGCAGCCCGGCGGCAGCCTGAGGCTGAGCTGCGCCGCCAGCGGCTTCACCTTCAGCGACAGCTGGATCCACTGGGTGAGGCAGGCCCCCGGCAAGGGCCTGGAGTGGGTGGCCTGGATCAGCCCCTACGGCGGCAGCACCTACTACGCCGACAGCGTGAAGGGCAGGTTCACCATCAGCGCCGACACCAGCAAGAACACCGCCTACCTGCAGATGAACAGCCTGAGGGCCGAGGACACCGCCGTGTACTACTGCGCCAGGAGGCACTGGCCCGGCGGCTTCGACTACTGGGGCCAGGGCACCCTGGTGACCGTGAGCAGCGGCGGCGGCGGCAGCGGCGGCGGCGGCAGCGGCGGCGGCGGCAGCGACATCCAGATGACCCAGAGCCCCAGCAGCCTGAGCGCCAGCGTGGGCGACAGGGTGACCATCACCTGCAGGGCCAGCCAGGACGTGAGCACCGCCGTGGCCTGGTACCAGCAGAAGCCCGGCAAGGCCCCCAAGCTGCTGATCTACAGCGCCAGCTTCCTGTACAGCGGCGTGCCCAGCAGGTTCAGCGGCAGCGGCAGCGGCACCGACTTCACCCTGACCATCAGCAGCCTGCAGCCCGAGGACTTCGCCACCTACTACTGCCAGCAGTACCTGTACCACCCCGCCACCTTCGGCCAGGGCACCAAGGTGGAGATCAAG(SEQ ID NO:51)。
Carrying out passage 15 generations on the engineering virus loaded with the target gene, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; carrying out passage 15 generations on the engineering virus loaded with the target gene after homology optimization, detecting the variation condition by deep sequencing with the sequencing depth of 1G, wherein the detection limit of the variation rate is 0.001; after two kinds of equivalent engineering viruses infect target cells, measuring the functional effective content of a target gene product, and normalizing the functional effective content into relative effective content according to the level before optimization of the target gene; comparing the target gene function before and after optimization of sequence homology with the stability of the virus genome.
The optimal At-scFv gene homologous subtype sequences At-scFv-a and At-scFv-b obtained by optimizing the target gene sequence have the longest non-mismatch homologous region length reduced, 15-generation variation rate reduced, and the effective content has no obvious difference compared with that before optimization. Therefore, the homology risk is effectively reduced while the function of the target gene is maintained through the homology sequence optimization.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (21)

1. A method of screening for a candidate insert for engineering a test sample nucleic acid sequence, the method comprising:
(1) determining an initial gene coding region sequence based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected;
(2) performing synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences;
(3) comparing the candidate coding region sequence with itself, a host genome and a viral genome and determining a homology high risk region for homology risk scoring of the candidate coding region sequence;
(4) determining a subset of preferred coding region sequences among the set of candidate coding region sequences based on the homology risk scores.
2. The method of claim 1, further comprising:
(ii) performing a codon pair bias score on the candidate coding region sequence based on codon frequency in the host; or
(ii) performing a codon bias score on the candidate coding region sequence based on codon frequency in the host; or
Performing CpG scoring on the candidate coding region sequence based on C base frequency, G base frequency and CpG sequence frequency in the candidate coding region sequence; or
Performing a TpA score on the candidate coding region sequence based on the A base frequency, the T base frequency and the TpA sequence frequency in the candidate coding region sequence; or
Predicting a minimum free energy of RNA secondary structure based on coding mRNA sequences of the candidate coding region sequences; or
Scoring a microsatellite instability score for the candidate coding region sequence based on microsatellite sequences in the candidate coding region sequence;
optionally, said step (4) further comprises determining a subset of preferred coding region sequences in said set of candidate coding region sequences based on said homology risk score and including at least one of the following scores,
said codon pair bias score, said codon bias score, said CpG score, said TpA score, said mRNA secondary structure minimum free energy, and said microsatellite instability score.
3. The method of claim 1, wherein in step (3), a sliding window sequence alignment method is adopted, and the length of the sliding window is 12-17 bp.
4. The method of claim 2, wherein the codon pair bias score is determined based on the following formula:
Figure FDA0002421780550000011
5. the method of claim 2, wherein the codon preference score is determined based on the following formula:
Figure FDA0002421780550000021
6. the method of claim 2, wherein the CpG score is determined based on the following formula:
Figure FDA0002421780550000022
7. the method of claim 2, wherein the TpA score is determined based on the following equation:
Figure FDA0002421780550000023
8. the method of claim 2, wherein the RNA secondary structure minimum free energy prediction is based on software Mfold or vienna RNA.
9. The method of claim 2, wherein the microsatellite instability score is determined based on:
for the microsatellite region which is continuously repeated for at least 3 times and the length of the microsatellite sequence is less than the length of the sliding window of the homology risk score, counting the length and frequency of the longest microsatellite region, comparing the length and frequency of the longest microsatellite region from high to low according to priority, wherein the higher the priority and the larger the value, the higher the instability of the microsatellite is.
10. The method of claim 1, wherein in step (1), an initial non-coding region sequence is further determined, and further comprising:
(5) obtaining a first homologous subtype sequence of the initial non-coding region sequence;
(6) determining respective conserved regions and non-conserved regions based on the initial non-coding region sequence and the first homologous subtype sequence;
(7) obtaining a set of candidate non-coding region sequences consisting of a plurality of candidate non-coding region sequences by at least one of random mutagenesis and truncation of the non-conserved region:
(8) performing said homology risk scoring and optionally microsatellite instability scoring on said non-coding region sequences to determine a preferred subset of non-coding region sequences among said set of candidate non-coding region sequences.
11. A method of screening for a candidate insert for engineering a test sample nucleic acid sequence, the method comprising:
(a) determining an initial non-coding region sequence based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected;
(b) obtaining a first homologous subtype sequence of the initial non-coding region sequence;
(c) determining respective conserved regions and non-conserved regions based on the initial non-coding region sequence and the first homologous subtype sequence;
(d) obtaining a set of candidate non-coding region sequences consisting of a plurality of candidate non-coding region sequences by at least one of random mutagenesis and truncation of the non-conserved region:
(e) (ii) performing said homology risk score and optionally a microsatellite instability score on said non-coding region sequence;
(f) determining a preferred subset of non-coding region sequences among the set of candidate non-coding region sequences based on the homology risk score and optionally the microsatellite instability score.
12. The method of any one of claims 1 to 11, wherein the test sample nucleic acid sequence comprises a viral genome sequence.
13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of screening candidate inserts according to any one of claims 1 to 12.
14. An electronic device comprising a memory, a processor;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the method of screening candidate inserts according to any one of claims 1-12.
15. A system for screening candidate inserts for engineering a test sample nucleic acid sequence, the system comprising:
the device for determining the sequence of the coding region of the initial gene determines the sequence of the coding region of the initial gene based on a candidate gene for modifying a nucleic acid sequence of a sample to be detected;
a candidate coding region sequence determining and collecting device which is connected with the initial gene coding region sequence determining and collecting device and is used for carrying out synonymous codon replacement on the initial gene coding region sequence so as to obtain a candidate coding region sequence set consisting of a plurality of candidate coding region sequences;
the homology risk scoring device is connected with the candidate coding region sequence determining and collecting device and is used for comparing the candidate coding region sequences with a host genome and a virus genome and determining a homology high risk region so as to perform homology risk scoring on the candidate coding region sequences; and
optionally, optionally
The codon pair preference scoring device is connected with the candidate coding region sequence determining and collecting device and is used for scoring the candidate coding region sequences based on the codon frequency in the host; or
The codon preference scoring device is connected with the candidate coding region sequence determination device and is used for scoring the codon preference of the candidate coding region sequence based on the codon frequency in the host; or
The CpG scoring device is connected with the candidate coding region sequence set determining device and is used for carrying out CpG scoring on the candidate coding region sequences based on the C base frequency, the G base frequency and the CpG sequence frequency in the candidate coding region sequences; or
The TpA scoring device is connected with the candidate coding region sequence determination device and is used for performing TpA scoring on the candidate coding region sequences based on the A base frequency, the T base frequency and the TpA sequence frequency in the candidate coding region sequences; or
The mRNA secondary structure minimum free energy prediction device is connected with the candidate coding region sequence aggregation device and is used for carrying out mRNA secondary structure minimum free energy prediction on the candidate coding region sequence based on the candidate coding region sequence mRNA sequence; or
A microsatellite instability scoring device, said microsatellite instability scoring device being adjacent to said candidate coding sequence set device for scoring microsatellite instability of said candidate coding sequence based on microsatellite sequences in said candidate coding sequence; and
means for determining a subset of preferred coding region sequences, said means for determining a subset of preferred coding region sequences being connected to said homology risk scoring means and optionally to said codon pair bias scoring means, codon bias scoring means, CpG scoring means, TpA scoring means, mRNA secondary structure minimum free energy prediction means and microsatellite instability scoring means, for determining a subset of preferred coding region sequences in said set of candidate coding region sequences based on said homology risk scores and optionally said codon pair bias scores, said codon bias scores, said CpG scores, said TpA scores, said mRNA secondary structure minimum free energy and said microsatellite instability scores.
16. A system for screening candidate inserts for engineering a test sample nucleic acid sequence, the system comprising:
means for determining the sequence of the initial non-coding region,
the first homologous subtype sequence acquisition device is connected with the initial non-coding region sequence determination device and is used for acquiring a first homologous subtype sequence of the initial non-coding region sequence;
a conservative region and non-conservative region determining device, which is connected with the initial non-coding region sequence determining device and the first homologous subtype sequence acquiring device and is used for determining a conservative region and a non-conservative region respectively based on the initial non-coding region sequence and the first homologous subtype sequence;
a candidate non-coding region sequence set determining device which is connected with the conservative region determining device and the non-conservative region determining device, and obtains a candidate non-coding region sequence set composed of a plurality of candidate non-coding region sequences by performing at least one of random mutation and truncation treatment on the non-conservative region:
means for determining a subset of preferred non-coding region sequences, said means for determining a subset of preferred non-coding region sequences being connected to said means for determining a set of candidate non-coding region sequences for which a subset of preferred non-coding region sequences is determined, said means for determining a set of candidate non-coding region sequences being used for performing said homology risk scoring and optionally microsatellite instability scoring on said non-coding region sequences.
17. A UAS sequence is characterized by having a nucleotide sequence shown in SEQ ID NO 1, 4-6, 14, 17, 23, 25, 31 and 34.
18. A set of UAS combinations, comprising a series of UAS sequences,
the UAS sequence has nucleotide sequences shown as SEQ ID NO 4-6 and 17; or
Nucleotide sequences shown as SEQ ID NO. 1, 23, 31 and 34; or
1, 4, 5, 6 and 14; or
Nucleotide sequences shown as SEQ ID NO 17, 23, 25, 31 and 34.
19. The UAS combination of claim 18, wherein the UAS combination has a UAS join order from 5 'end to 3' end as follows:
B1-A5-A6-A4, D5-A1-D2-C1, A5-A1-A6-A4-A14 or B1-D5-D2-C1-C3;
optionally, the UAS combination has the nucleotide sequence shown in SEQ ID NO 39, 41, 42 and 44.
20. A Ni-scFv gene sequence, which is characterized in that the sequence has a nucleotide sequence shown in SEQ ID NO 46 or 47.
21. An At-scFv gene sequence having the nucleotide sequence shown in SEQ ID NO 49 or 50.
CN202010207885.8A 2020-03-23 2020-03-23 Method and system for screening candidate inserts Pending CN113436683A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010207885.8A CN113436683A (en) 2020-03-23 2020-03-23 Method and system for screening candidate inserts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010207885.8A CN113436683A (en) 2020-03-23 2020-03-23 Method and system for screening candidate inserts

Publications (1)

Publication Number Publication Date
CN113436683A true CN113436683A (en) 2021-09-24

Family

ID=77753299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010207885.8A Pending CN113436683A (en) 2020-03-23 2020-03-23 Method and system for screening candidate inserts

Country Status (1)

Country Link
CN (1) CN113436683A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115602244A (en) * 2022-10-24 2023-01-13 哈尔滨工业大学(Cn) Genome variation detection method based on sequence alignment framework

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1309716A (en) * 1998-07-14 2001-08-22 阿文蒂斯药物德国有限公司 Expression system contg. chimeric promoters with binding sites for recombinant transcription factors
CN1388247A (en) * 2001-05-30 2003-01-01 方炳良 Method of strengthening specific destination gene expression of cell
US20080113351A1 (en) * 2004-05-11 2008-05-15 Alphagen Co., Ltd. Polynucleotides for causing RNA interference and method for inhibiting gene expression using the same
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105886616A (en) * 2016-04-20 2016-08-24 广东省农业科学院农业生物基因研究中心 Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
CN108221058A (en) * 2017-12-29 2018-06-29 苏州金唯智生物科技有限公司 One boar full-length genome sgRNA libraries and its construction method and application
CN109416927A (en) * 2016-10-07 2019-03-01 Illumina公司 The system and method for secondary analysis for nucleotide sequencing data
WO2019169117A1 (en) * 2018-02-28 2019-09-06 Cornell University Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets
WO2019213478A1 (en) * 2018-05-04 2019-11-07 Nanostring Technologies, Inc. Gene expression assay for measurement of dna mismatch repair deficiency
CN110691792A (en) * 2017-01-10 2020-01-14 朱诺治疗学股份有限公司 Epigenetic analysis of cell therapies and related methods

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1309716A (en) * 1998-07-14 2001-08-22 阿文蒂斯药物德国有限公司 Expression system contg. chimeric promoters with binding sites for recombinant transcription factors
CN1388247A (en) * 2001-05-30 2003-01-01 方炳良 Method of strengthening specific destination gene expression of cell
US20080113351A1 (en) * 2004-05-11 2008-05-15 Alphagen Co., Ltd. Polynucleotides for causing RNA interference and method for inhibiting gene expression using the same
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105886616A (en) * 2016-04-20 2016-08-24 广东省农业科学院农业生物基因研究中心 Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
CN109416927A (en) * 2016-10-07 2019-03-01 Illumina公司 The system and method for secondary analysis for nucleotide sequencing data
CN110691792A (en) * 2017-01-10 2020-01-14 朱诺治疗学股份有限公司 Epigenetic analysis of cell therapies and related methods
CN108221058A (en) * 2017-12-29 2018-06-29 苏州金唯智生物科技有限公司 One boar full-length genome sgRNA libraries and its construction method and application
WO2019169117A1 (en) * 2018-02-28 2019-09-06 Cornell University Detecting variant alleles in complex, repetitive sequences within whole genome sequencing data sets
WO2019213478A1 (en) * 2018-05-04 2019-11-07 Nanostring Technologies, Inc. Gene expression assay for measurement of dna mismatch repair deficiency

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUANG HUIYA 等: "Oncolytic adenovirus programmed by synthetic gene circuit for cancer immunotherapy", 《NATURE COMMUNICATIONS》 *
刘冉;尹立红;浦跃朴;王仪;潘恩春;: "错配修复基因的异常表达与淮安食管癌发病的关系", 癌变.畸变.突变, no. 06 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115602244A (en) * 2022-10-24 2023-01-13 哈尔滨工业大学(Cn) Genome variation detection method based on sequence alignment framework
CN115602244B (en) * 2022-10-24 2023-04-28 哈尔滨工业大学 Genome variation detection method based on sequence alignment skeleton

Similar Documents

Publication Publication Date Title
Fahim et al. Resistance to Wheat streak mosaic virus generated by expression of an artificial polycistronic microRNA in wheat
Liu et al. Validation of reference genes for gene expression studies in virus-infected Nicotiana benthamiana using quantitative real-time PCR
Kaplinsky et al. Maize transgene results in Mexico are artefacts
Nagaki et al. Structure, divergence, and distribution of the CRR centromeric retrotransposon family in rice
Digard et al. Intra-genome variability in the dinucleotide composition of SARS-CoV-2
JP2019519233A5 (en)
Öztürk et al. scAAVengr, a transcriptome-based pipeline for quantitative ranking of engineered AAVs with single-cell resolution
JP2009538131A (en) Methods for identifying sequence motifs and their applications
JP7436488B2 (en) Methods and materials for single cell transcriptome-based development of AAV vectors and promoters
Hildebrandt et al. Characterizing the molecular basis of attenuation of Marek's disease virus via in vitro serial passage identifies de novo mutations in the helicase-primase subunit gene UL5 and other candidates associated with reduced virulence
Smith et al. Avian influenza virus PB1 gene in H3N2 viruses evolved in humans to reduce interferon inhibition by skewing codon usage toward interferon-altered tRNA pools
CN102899335A (en) Method for obtaining genome sequence of papaya ringspot virus through high-throughput small RNA sequencing
WO2023174305A1 (en) Development of rna-targeted gene editing tool
Møller et al. miRNA-mediated targeting of human cytomegalovirus reveals biological host and viral targets of IE2
CN113436683A (en) Method and system for screening candidate inserts
Costa et al. High-throughput detection of a large set of viruses and viroids of pome and stone fruit trees by multiplex PCR-based amplicon sequencing
Hashida et al. Phylogenetic analysis of Merkel cell polyomavirus based on full-length LT and VP1 gene sequences derived from neoplastic tumours in Japanese patients
He et al. Linc-GALMD1 regulates viral gene expression in the chicken
CN116949011A (en) Isolated Cas13 protein, gene editing system based on same and use thereof
Demirci et al. A machine learning approach for MicroRNA precursor prediction in retro-transcribing virus genomes
Shan et al. Molecular evolution of protein sequences and codon usage in monkeypox viruses
Pulido-Quetglas et al. Designing libraries for pooled CRISPR functional screens of long noncoding RNAs
Tombácz et al. Hybrid sequencing discloses unique aspects of the transcriptomic architecture in equid alphaherpesvirus 1
Kumar et al. Prediction of miRNA and identification of their relationship network related to late blight disease of potato
US20230295611A1 (en) Cas9 protein for genome editing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination