CN111161800A - Method, system, storage medium, and electronic device for diagnosing sequence of gene vector - Google Patents

Method, system, storage medium, and electronic device for diagnosing sequence of gene vector Download PDF

Info

Publication number
CN111161800A
CN111161800A CN201911402569.XA CN201911402569A CN111161800A CN 111161800 A CN111161800 A CN 111161800A CN 201911402569 A CN201911402569 A CN 201911402569A CN 111161800 A CN111161800 A CN 111161800A
Authority
CN
China
Prior art keywords
gene
vector
sequence
diagnosed
judging whether
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911402569.XA
Other languages
Chinese (zh)
Other versions
CN111161800B (en
Inventor
蓝田
岑文杰
钟怡然
谢宁
丘佳倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunzhou Biotechnology (Guangzhou) Co.,Ltd.
Original Assignee
Yunzhou Biosciences (guangzhou) Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunzhou Biosciences (guangzhou) Inc filed Critical Yunzhou Biosciences (guangzhou) Inc
Priority to CN201911402569.XA priority Critical patent/CN111161800B/en
Publication of CN111161800A publication Critical patent/CN111161800A/en
Application granted granted Critical
Publication of CN111161800B publication Critical patent/CN111161800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a sequence diagnosis method, a system, a computer storage medium and electronic equipment of a gene vector, wherein the method comprises the following steps: s1, obtaining gene sequences of a plurality of gene vectors to obtain a gene sequence set; s2, randomly classifying the gene sequences of the gene sequence set according to the characteristics of the gene sequences to obtain a plurality of subsets; s3, respectively training the gene sequences of each subset to obtain a training model; s4, acquiring a gene sequence of the gene vector to be diagnosed and/or the target gene, and inputting the gene sequence into the training model; and S5, setting corresponding labels for the gene vector to be diagnosed and/or the target gene by the training model according to the detection result. According to the sequence diagnosis method of the gene vector, provided by the embodiment of the invention, the diagnosis efficiency and the production efficiency can be effectively improved, the situations that the vector is difficult to find after production, the cost is too high or the production cannot be carried out and the like are prevented, and the cost is reduced.

Description

Method, system, storage medium, and electronic device for diagnosing sequence of gene vector
Technical Field
The present invention relates to the field of gene diagnosis, and more particularly, to a method for sequence diagnosis of a gene vector, a system for sequence diagnosis of a gene vector, a computer storage medium, and an electronic device.
Background
With the continuous development of biotechnology, people have more and more requirements on a carrier, which is a basic material required by biological experiments, and for carrier manufacturers, along with the increase of the manufacturing amount of the carrier, the diversification of carrier designed by customers makes the manufacturers unable to price one by one, and it is often found in the carrier production process that the manufacturing difficulty is increased due to different characteristics of sequences, the production needs to consume more cost or cannot be completed at all, the service experience of the customers is greatly influenced by changing prices or stopping production midway, the production efficiency of both parties is reduced, and both the buyers and the sellers suffer unexpected loss.
In the prior scheme, before a carrier enters production (when the carrier is generated), a certain judgment is made on the sequence characteristics of the carrier, when a user manually inquires the price, the carrier price is preliminarily judged by workers, and meanwhile, when the situation that the completion difficulty is high or the completion cannot be completed is found, an early warning is sent to a producer in advance to pay attention to the situation, the pricing and the production plan are adjusted in advance to inform the customer of the specific situation, so that the situations are prevented. However, because the number of carriers is large, the defects exist in manual one-by-one judgment, the problem of judgment errors is easy to occur, the labor amount is large, the cost is difficult to control, and the production efficiency is influenced to a great extent.
Disclosure of Invention
In view of the above, the present invention provides a method for diagnosing a sequence of a gene vector, a system for diagnosing a sequence of a gene vector, a computer storage medium, and an electronic device, which can effectively improve the diagnosis efficiency and the production efficiency of a gene vector and reduce the cost.
In order to solve the above technical problems, in one aspect, the present invention provides a method for diagnosing a sequence of a gene vector, comprising the steps of: s1, obtaining gene sequences of a plurality of gene vectors to obtain a gene sequence set; s2, randomly classifying the gene sequences of the gene sequence set according to the characteristics of the gene sequences to obtain a plurality of subsets; s3, respectively training the gene sequences of each subset to obtain a training model; s4, acquiring a gene sequence of the gene vector to be diagnosed and/or the target gene, and inputting the gene sequence into the training model; and S5, setting corresponding labels for the gene vector to be diagnosed and/or the target gene by the training model according to the detection result.
According to the gene vector sequence diagnosis method provided by the embodiment of the invention, a reasonable training model is obtained by collecting, classifying and training the gene sequence of the gene vector, and the training model is used for directly diagnosing the gene vector to be diagnosed and/or the target gene, so that the diagnosis efficiency and the production efficiency can be effectively improved, the situations that the vector is difficult to find after production, the cost is too high or the production cannot be carried out and the like are prevented, and the cost is reduced.
According to some embodiments of the invention, in step S2, the characteristics of the gene sequence include: the GC content of the gene sequence, the number of repeated sequences of the gene sequence, the length of the gene sequence, whether the gene sequence contains non-ATCG characters, whether the gene vector is a virus or not, and the number of the subsets is five.
According to some embodiments of the invention, step S3 includes: calculating the GC content of the promoter; judging whether the total GC content of the promoter is more than 70% or the GC content of a local fragment is more than 80%, if so, setting a first label for the gene sequence; and judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
According to some embodiments of the invention, the local fragment is 180bp to 230 bp.
According to some embodiments of the invention, step S3 includes:
and (3) judging whether the promoter has a repetitive sequence of more than 10 repeats or a repetitive sequence of more than 20 continuous single bases, and if so, setting a third label for the gene sequence.
According to some embodiments of the invention, step S3 includes: calculating the length of the promoter; judging whether the length of the promoter is less than 100bp, if so, setting a fourth label for the gene sequence; and judging whether the length of the promoter is more than 77000bp, and if so, setting a fifth label for the gene sequence.
According to some embodiments of the invention, step S3 includes: and judging whether the gene sequence contains non-ATCG characters, if so, setting a sixth label for the gene sequence.
According to some embodiments of the invention, step S3 further comprises: judging the virus type of the gene vector to be diagnosed; if the gene vector to be diagnosed is a lentivirus vector, judging whether a sequence fragment between delta 5' LTR ' and delta U3/3' LTR elements is more than 9200 bp; if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp; if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp; if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLV LTR element is more than 8300 bp; if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether a sequence fragment between the MSCV 5'LTR and the MSCV3' LTR element is more than 8300 bp; if yes, setting a seventh label for the gene vector to be diagnosed.
In a second aspect, the present invention provides a system for diagnosing a sequence of a gene vector, including: the system comprises a gene sequence acquisition module, a diagnosis module and a diagnosis module, wherein the gene sequence acquisition module can acquire a gene sequence of a gene vector to be diagnosed and/or a target gene uploaded by a user; the data processing module can receive the gene sequence obtained by the gene sequence obtaining module and judge the characteristics of the gene sequence to obtain a judgment result; and the label printing module prints a corresponding label on the gene sequence needing to be provided with the label according to the judgment result.
According to some embodiments of the invention, the data processing module is capable of calculating the GC content of the promoter, determining whether the total GC content of the promoter is greater than 70% or the GC content of the local fragment is greater than 80%, and if so, setting a first tag for the gene sequence; and judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
According to some embodiments of the invention, the data processing module is capable of determining whether the promoter has a repeat sequence of 10 or more repeats or a repeat sequence of more than 20 consecutive single bases, and if so, setting a third tag for the gene sequence.
According to some embodiments of the present invention, the data processing module can calculate the length of the promoter, determine whether the length of the promoter is less than 100bp, and if so, set a fourth tag for the gene sequence; and judging whether the length of the promoter is more than 77000bp, and if so, setting a fifth label for the gene sequence.
According to some embodiments of the invention, the data processing module is capable of determining whether a gene sequence contains a non-ATCG character, and if so, setting a sixth tag for the gene sequence.
According to some embodiments of the invention, the data processing module is capable of determining the viral species of the genetic vector to be diagnosed, and if the genetic vector to be diagnosed is a lentiviral vector, determining whether the sequence fragment between the Δ 5'LTR' and the 'Δ U3/3' LTR element is greater than 9200 bp; if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp; if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp; if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLV LTR element is more than 8300 bp; if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether a sequence fragment between MSCV 5'LTR and MSCV3' LTR elements is more than 8300 bp; if yes, setting a seventh label for the gene vector to be diagnosed.
In a third aspect, an embodiment of the present invention provides a computer storage medium including one or more computer instructions, which when executed implement the method according to the above embodiment.
An electronic device according to a fourth aspect of the present invention comprises a memory for storing one or more computer instructions and a processor; the processor is configured to invoke and execute the one or more computer instructions to implement the method according to any of the embodiments described above.
Drawings
FIG. 1 is a flowchart of a sequence diagnosis method of a gene vector according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a sequence diagnostic system of a gene vector according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an electronic device according to an embodiment of the invention.
Reference numerals:
a gene vector sequence diagnosis system 100;
a gene sequence acquisition module 10; a data processing module 20; a label printing module 30;
an electronic device 300;
a memory 310; an operating system 311; an application 312;
a processor 320; a network interface 330; an input device 340; a hard disk 350; a display device 360.
Detailed Description
The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The following first explains the related terms referred to in the present application.
Carrier: vector (Vector) refers to a self-replicating DNA molecule that transfers a DNA fragment (the gene of interest) to a recipient cell in a recombinant DNA technique. The three most commonly used vectors are bacterial plasmids, bacteriophages and animal and plant viruses. In actual life, insulin can be introduced into E.coli by using a vector into which a plasmid into which an insulin gene fragment has been inserted. The plasmid into which the gene fragment is inserted is called a vector. The plasmid can self-replicate in bacteria and does not affect the original activity of organisms.
Constructing a vector: vector construction is one of the commonly used means for molecular biology research. Mainly comprises the reconstruction of the multi-cloning site MCS of the existing vector and the reconstruction of functional elements such as the promoter, the enhancer, the screening marker and the like of the existing vector. The construction of the vector through a computer means that a new vector is constructed by inserting or modifying a nucleic acid sequence into a functional element needing to be modified in an existing vector framework.
A promoter: a promoter is a DNA sequence recognized, bound and initiated by RNA polymerase and contains conserved sequences required for RNA polymerase specific binding and transcription initiation, most of which are located upstream of the transcription initiation point of a structural gene, and is not transcribed per se. However, some promoters, such as tRNA promoters, are located downstream of the transcription start site and these DNA sequences can be transcribed. The nature of the promoter was originally identified by mutations that increase or decrease the transcription rate of the gene. Promoters are generally located upstream of the transcription start site.
The target gene is as follows: the gene of interest (also called target gene) refers to a specific gene that is studied or manipulated in an experiment. In the gene cloning process, the target gene is the gene which is to be isolated, purified, cloned and transformed into an organism to bring about the desired phenotypic trait, such as resistance to insects or herbicides.
The sequence diagnosis method of the gene vector according to the embodiment of the present invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the sequence diagnosis method of a gene vector according to an embodiment of the present invention includes the steps of:
s1, obtaining gene sequences of a plurality of gene vectors to obtain a gene sequence set.
And S2, randomly classifying the gene sequences of the gene sequence set according to the characteristics of the gene sequences to obtain a plurality of subsets.
And S3, respectively training the gene sequences of each subset to obtain a training model.
And S4, acquiring the gene sequence of the gene vector to be diagnosed and/or the target gene, and inputting the gene sequence into the training model.
And S5, setting corresponding labels for the gene vector to be diagnosed and/or the target gene by the training model according to the detection result.
In other words, before diagnosing the gene vector to be diagnosed, the method for diagnosing the sequence of the gene vector according to the embodiment of the present invention first collects the gene sequence of the existing gene vector to form a data set, then randomly divides the data set into a plurality of corresponding subsets according to different characteristics of the gene vector, and then trains the gene sequence of each subset according to the set instruction to obtain the training model. And then acquiring a gene sequence of a vector to be diagnosed or a target gene provided by a user, diagnosing the gene sequence through a training model, and setting a corresponding label for the gene sequence according to a diagnosis result so as to comprehensively judge the repeated sequence of the vector or the target gene, virus packaging difficulty, production risk, vector production difficulty, biological effectiveness of the vector and the like by subsequent production personnel or other personnel.
Therefore, according to the gene vector sequence diagnosis method provided by the embodiment of the invention, a reasonable training model is obtained by collecting, classifying and training the gene sequence of the gene vector, and the gene vector to be diagnosed and/or the target gene are/is directly diagnosed by using the training model before production, so that the diagnosis efficiency and the production efficiency can be effectively improved, the situations that the vector is difficult to find after production, the cost is too high or the production cannot be carried out and the like are prevented, and the cost is reduced.
According to an embodiment of the present invention, in step S2, the characteristics of the gene sequence include: the GC content of the gene sequence, the number of repeated sequences of the gene sequence, the length of the gene sequence, whether the gene sequence contains non-ATCG characters, whether the gene vector is a virus or not, and the number of the subsets is five.
Step S3 also includes different training procedures for different characteristics of the gene sequence. In some embodiments of the invention, step S3 includes:
the GC content of the promoter was calculated.
And judging whether the total GC content of the promoter is more than 70% or the GC content of the local fragment is more than 80%, and if so, setting a first label for the gene sequence.
And judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
Preferably, the partial fragment is 180bp to 230 bp.
Specifically, in the actual diagnosis process, the GC contents of a promoter and a target gene are calculated firstly, and if the total GC content of a fragment of the promoter or the target gene is more than 70 percent or the GC content of a local fragment (200bp) is more than 80 percent, a Risk-High GC label is marked on the vector; if the total GC content of the promoter or the target gene fragment is less than 30 percent or the GC content of the local fragment (200bp) is less than 20 percent, labeling the vector with Risk-Low GC; otherwise, no processing is performed.
In some embodiments of the invention, step S3 includes: and (3) judging whether the promoter has a repetitive sequence of more than 10 repeats or a repetitive sequence of more than 20 continuous single bases, and if so, setting a third label for the gene sequence.
Specifically, in the actual diagnosis process, it is judged whether or not the promoter or the target gene in the vector has a repetitive sequence of 10 or more repeats or a repetitive sequence of more than 20 (including 20) single bases in succession. If yes, a label Risk-Repeat is marked, otherwise, no processing is carried out.
According to an embodiment of the present invention, step S3 may further include:
calculating the length of the promoter;
judging whether the length of the promoter is less than 100bp, if so, setting a fourth label for the gene sequence;
and judging whether the length of the promoter is more than 77000bp, and if so, setting a fifth label for the gene sequence.
In other words, in this step, the lengths of the promoter and the target gene can be calculated, and if the length of the promoter or the target gene is less than or equal to 100bp, the vector is labeled with Risk-Small Insert; if the length of the promoter or the target gene is more than or equal to 77000bp, labeling the vector with Risk-Large Insert; otherwise, no processing is performed.
In other embodiments of the present invention, step S3 includes: and judging whether the gene sequence contains non-ATCG characters, if so, setting a sixth label for the gene sequence. For example, the step is to judge whether the full sequence of the vector contains non-ATCG characters, if so, the vector is labeled with a labeling Ambiguous Base; otherwise, no processing is performed.
According to an embodiment of the present invention, when the gene vector is a virus, the step S3 further includes:
judging the virus type of the gene vector to be diagnosed;
if the gene vector to be diagnosed is a lentivirus vector, judging whether a sequence fragment between delta 5' LTR ' and delta U3/3' LTR elements is more than 9200 bp;
if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp;
if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp;
if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLVLTR element is more than 8300 bp;
if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether a sequence fragment between MSCV 5'LTR and MSCV3' LTR elements is more than 8300 bp;
if yes, setting a seventh label for the gene vector to be diagnosed.
Specifically, if the sequence fragment between Δ 5' LTR ' and Δ U3/3' LTR elements is greater than 9200bp when the vector is a Lentiviral (LV) vector, or greater than 4700bp when the vector is an adeno-associated viral (AAV) vector, or greater than 38700bp when the vector is an Adenoviral (AV) vector, or greater than 8300bp when the vector is a retroviral (MMLV) vector, or greater than 8300bp when the vector is a retroviral (cv) vector, then the vector is tagged with a Risk-Over Virus packingsize; otherwise, no processing is performed.
Therefore, different training is respectively carried out on the plurality of subsets, when a gene carrier to be diagnosed needs to be diagnosed, different training models can be adopted to diagnose the gene carrier to be diagnosed or a target gene according to needs, different characteristic labels are marked on the carrier by judging the sequence characteristics of the carrier in advance, the price of the carrier can be preliminarily judged by a worker when the user inquires the price, and meanwhile, the situations that the carrier is difficult to find after entering production, the cost is too high or cannot be carried out and the like are prevented, the efficiency is improved, and the losses of buyers and sellers are reduced.
As shown in fig. 2, a gene vector sequence diagnosis system 100 according to an embodiment of the present invention includes a gene sequence acquisition module 10, a data processing module 20, and a label printing module 30.
Specifically, the gene sequence acquisition module 10 can acquire a gene sequence of a gene vector to be diagnosed and/or a target gene uploaded by a user, the data processing module 20 can receive the gene sequence acquired by the gene sequence acquisition module 10 and judge the characteristics of the gene sequence to obtain a judgment result, and the label printing module 30 prints a corresponding label on the gene sequence of which the label needs to be set according to the judgment result.
The data processing module 20 can perform different data processing according to different characteristics of the gene sequence. In some embodiments of the present invention, the data processing module 20 can calculate the GC content of the promoter, determine whether the total GC content of the promoter is greater than 70% or the GC content of the local fragment is greater than 80%, and set a first tag for the gene sequence if yes; and judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
Alternatively, the data processing module 20 can determine whether the promoter has a repetitive sequence of 10 or more repeats or a repetitive sequence of more than 20 consecutive single bases, and if so, set a third tag for the gene sequence.
Optionally, the data processing module 20 can calculate the length of the promoter, determine whether the length of the promoter is less than 100bp, and set a fourth tag for the gene sequence if the length of the promoter is less than 100 bp; and judging whether the length of the promoter is more than 77000bp, if so, setting a fifth label for the gene sequence.
Alternatively, data processing module 20 can determine whether the gene sequence contains a non-ATCG character, and if so, set a sixth tag for the gene sequence.
Alternatively, the data processing module 20 can determine the virus type of the gene carrier to be diagnosed,
if the gene vector to be diagnosed is a lentiviral vector, judging whether the sequence fragment between the delta 5' LTR ' and the delta U3/3' LTR element is more than 9200 bp;
if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp;
if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp;
if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLV LTR element is more than 8300 bp;
if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether the sequence fragment between the MSCV 5'LTR and the MSCV3' LTR elements is more than 8300 bp;
if yes, a seventh label is set for the gene vector to be diagnosed.
The specific diagnosis process of the gene vector sequence diagnosis system 100 according to the embodiment of the present invention has been described in detail in the above-mentioned embodiments, and thus will not be described in detail.
It should be noted that the gene vector sequence diagnosis system 100 according to the embodiment of the present invention may be applied to a network, the gene sequence acquisition module 10 may be a data input window displayed on-line, and after a user inputs data from the data input window, the data processing module 20 in the background is used for processing data, the data processing module 20 diagnoses the gene sequence input by the user, and outputs a label printing result through the label printing module 30, the background acquires the data of the inquiry or order and simultaneously acquires the label of the corresponding carrier, and displays the label in one page, the price of the carrier can be checked by the staff, the staff can judge the price of the carrier primarily when the user inquires the price, meanwhile, the conditions that the carrier is difficult to find after entering production, the cost is too high or the carrier cannot be used are prevented, the efficiency is improved, and the losses of buyers and sellers are reduced.
In addition, the present invention provides a computer storage medium comprising one or more computer instructions that, when executed, implement any of the above-described methods for sequence diagnosis of a gene vector.
That is, the computer storage medium stores a computer program that, when executed by a processor, causes the processor to execute any one of the above-described methods for sequence diagnosis of a gene vector.
As shown in fig. 3, an embodiment of the present invention provides an electronic device 300, which includes a memory 310 and a processor 320, where the memory 310 is configured to store one or more computer instructions, and the processor 320 is configured to call and execute the one or more computer instructions, so as to implement any one of the methods described above.
That is, the electronic device 300 includes: a processor 320 and a memory 310, in which memory 310 computer program instructions are stored, wherein the computer program instructions, when executed by the processor, cause the processor 320 to perform any of the methods described above.
Further, as shown in fig. 3, the electronic device 300 further includes a network interface 330, an input device 340, a hard disk 350, and a display device 360.
The various interfaces and devices described above may be interconnected by a bus architecture. A bus architecture may be any architecture that may include any number of interconnected buses and bridges. Various circuits of one or more Central Processing Units (CPUs), represented in particular by processor 320, and one or more memories, represented by memory 310, are coupled together. The bus architecture may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like. It will be appreciated that a bus architecture is used to enable communications among the components. The bus architecture includes a power bus, a control bus, and a status signal bus, in addition to a data bus, all of which are well known in the art and therefore will not be described in detail herein.
The network interface 330 may be connected to a network (e.g., the internet, a local area network, etc.), and may obtain relevant data from the network and store the relevant data in the hard disk 350.
The input device 340 may receive various commands input by an operator and send the commands to the processor 320 for execution. The input device 340 may include a keyboard or a pointing device (e.g., a mouse, a trackball, a touch pad, a touch screen, or the like).
The display device 360 may display the result of the instructions executed by the processor 320.
The memory 310 is used for storing programs and data necessary for operating the operating system, and data such as intermediate results in the calculation process of the processor 320.
It will be appreciated that memory 310 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 310 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 310 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 311 and application programs 312.
The operating system 311 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 312 include various application programs, such as a Browser (Browser), and are used for implementing various application services. A program implementing methods of embodiments of the present invention may be included in application 312.
The method disclosed by the above embodiment of the present invention can be applied to the processor 320, or implemented by the processor 320. Processor 320 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 320. The processor 320 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 310, and the processor 320 reads the information in the memory 310 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
In particular, the processor 320 is also configured to read the computer program and execute any of the methods described above.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (16)

1. A method for diagnosing a sequence of a gene vector, comprising the steps of:
s1, obtaining gene sequences of a plurality of gene vectors to obtain a gene sequence set;
s2, randomly classifying the gene sequences of the gene sequence set according to the characteristics of the gene sequences to obtain a plurality of subsets;
s3, respectively training the gene sequences of each subset to obtain a training model;
s4, acquiring a gene sequence of the gene vector to be diagnosed and/or the target gene, and inputting the gene sequence into the training model;
and S5, setting corresponding labels for the gene vector to be diagnosed and/or the target gene by the training model according to the detection result.
2. The method of claim 1, wherein in step S2, the characteristics of the gene sequence include: the GC content of the gene sequence, the number of repeated sequences of the gene sequence, the length of the gene sequence, whether the gene sequence contains non-ATCG characters, whether the gene vector is a virus or not, and the number of the subsets is five.
3. The method according to claim 2, wherein step S3 includes:
calculating the GC content of the promoter;
judging whether the total GC content of the promoter is more than 70% or the GC content of a local fragment is more than 80%, if so, setting a first label for the gene sequence;
and judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
4. The method of claim 3, wherein the local fragment is 180bp to 230 bp.
5. The method according to claim 2, wherein step S3 includes:
and (3) judging whether the promoter has a repetitive sequence of more than 10 repeats or a repetitive sequence of more than 20 continuous single bases, and if so, setting a third label for the gene sequence.
6. The method according to claim 2, wherein step S3 includes:
calculating the length of the promoter;
judging whether the length of the promoter is less than 100bp, if so, setting a fourth label for the gene sequence;
and judging whether the length of the promoter is more than 77000bp, and if so, setting a fifth label for the gene sequence.
7. The method according to claim 2, wherein step S3 includes:
and judging whether the gene sequence contains non-ATCG characters, if so, setting a sixth label for the gene sequence.
8. The method according to claim 2, wherein step S3 further comprises:
judging the virus type of the gene vector to be diagnosed;
if the gene vector to be diagnosed is a lentivirus vector, judging whether a sequence fragment between delta 5' LTR ' and delta U3/3' LTR elements is more than 9200 bp;
if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp;
if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp;
if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLV LTR element is more than 8300 bp;
if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether a sequence fragment between MSCV 5'LTR and MSCV3' LTR elements is more than 8300 bp;
if yes, setting a seventh label for the gene vector to be diagnosed.
9. A system for sequence diagnosis of a gene vector, comprising:
the system comprises a gene sequence acquisition module, a diagnosis module and a diagnosis module, wherein the gene sequence acquisition module can acquire a gene sequence of a gene vector to be diagnosed and/or a target gene uploaded by a user;
the data processing module can receive the gene sequence obtained by the gene sequence obtaining module and judge the characteristics of the gene sequence to obtain a judgment result;
and the label printing module prints a corresponding label on the gene sequence needing to be provided with the label according to the judgment result.
10. The system of claim 9, wherein the data processing module is capable of calculating the GC content of the promoter, determining whether the total GC content of the promoter is greater than 70% or the GC content of the local fragment is greater than 80%, and if so, setting a first tag for the gene sequence; and judging whether the total GC content of the promoter is less than 30% or the GC content of the local fragment is less than 20%, and if so, setting a second label for the gene sequence.
11. The system for sequence diagnosis of a gene vector according to claim 9, wherein the data processing module is capable of determining whether the promoter has a repetitive sequence of 10 or more repeats or a repetitive sequence of more than 20 consecutive single bases, and if so, setting a third tag for the gene sequence.
12. The system of claim 9, wherein the data processing module is capable of calculating the length of a promoter, determining whether the length of the promoter is less than 100bp, and if so, setting a fourth tag for the gene sequence; and judging whether the length of the promoter is more than 77000bp, and if so, setting a fifth label for the gene sequence.
13. The system of claim 9, wherein the data processing module is capable of determining whether a gene sequence contains a non-ATCG character, and if so, setting a sixth tag for the gene sequence.
14. The system of claim 9, wherein the data processing module is capable of determining the virus type of the genetic vector to be diagnosed,
if the gene vector to be diagnosed is a lentivirus vector, judging whether a sequence fragment between delta 5' LTR ' and delta U3/3' LTR elements is more than 9200 bp;
if the gene vector to be diagnosed is an adeno-associated virus vector, judging whether a sequence fragment between 5'ITR and 3' ITR elements is larger than 4700 bp;
if the gene vector to be diagnosed is an adenovirus vector, judging whether the sequence fragment between the 5'ITR and 3' ITR elements is larger than 38700 bp;
if the gene vector to be diagnosed is a retrovirus MMLV vector, judging whether a sequence fragment between the 5'MoMuLV LTR and the 3' MoMuLV LTR element is more than 8300 bp;
if the gene vector to be diagnosed is a retrovirus MSCV vector, judging whether a sequence fragment between MSCV 5'LTR and MSCV3' LTR elements is more than 8300 bp;
if yes, setting a seventh label for the gene vector to be diagnosed.
15. A computer storage medium comprising one or more computer instructions which, when executed, implement the method of any one of claims 1-8.
16. An electronic device comprising a memory and a processor, wherein,
the memory is to store one or more computer instructions;
the processor is configured to invoke and execute the one or more computer instructions to implement the method of any one of claims 1-8.
CN201911402569.XA 2019-12-30 2019-12-30 Method, system, storage medium, and electronic device for diagnosing sequence of gene vector Active CN111161800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911402569.XA CN111161800B (en) 2019-12-30 2019-12-30 Method, system, storage medium, and electronic device for diagnosing sequence of gene vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911402569.XA CN111161800B (en) 2019-12-30 2019-12-30 Method, system, storage medium, and electronic device for diagnosing sequence of gene vector

Publications (2)

Publication Number Publication Date
CN111161800A true CN111161800A (en) 2020-05-15
CN111161800B CN111161800B (en) 2021-05-07

Family

ID=70559448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911402569.XA Active CN111161800B (en) 2019-12-30 2019-12-30 Method, system, storage medium, and electronic device for diagnosing sequence of gene vector

Country Status (1)

Country Link
CN (1) CN111161800B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687338A (en) * 2020-12-31 2021-04-20 云舟生物科技(广州)有限公司 Method for storing and restoring gene sequence, computer storage medium and electronic device
CN113921083A (en) * 2021-10-27 2022-01-11 云舟生物科技(广州)有限公司 Custom sequence analysis method, computer storage medium and electronic device
CN115881227A (en) * 2022-12-28 2023-03-31 云舟生物科技(广州)股份有限公司 Carrier customization method and computer storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050026182A1 (en) * 2001-05-25 2005-02-03 Stephane Bejanin Human CDNAS and proteins and uses thereof
US20060122816A1 (en) * 2002-05-20 2006-06-08 Schadt Eric E Computer systems and methods for subdividing a complex disease into component diseases
KR20100001177A (en) * 2008-06-26 2010-01-06 주식회사 비츠로시스 Gene selection algorithm using principal component analysis
CN104462870A (en) * 2015-01-09 2015-03-25 苏州大学 Method and device for identifying human gene promoter
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system
CN106295245A (en) * 2016-07-27 2017-01-04 广州麦仑信息科技有限公司 The method of storehouse noise reduction own coding gene information feature extraction based on Caffe
CN108753827A (en) * 2018-06-20 2018-11-06 上海生博生物医药科技有限公司 A kind of targeting excretion body carries out carrier and its construction method and application that gene carries medicine
CN109979536A (en) * 2019-03-07 2019-07-05 青岛市疾病预防控制中心(青岛市预防医学研究院) It is a kind of based on DNA bar code to the identification method of species
CN110070914A (en) * 2019-03-15 2019-07-30 崔大超 A kind of gene order recognition methods, system and computer readable storage medium
CN110468207A (en) * 2019-09-02 2019-11-19 北京师范大学 Based on the glioma EM/PM molecular typing methods of Taqman low density chip and its application
CN110534159A (en) * 2019-07-22 2019-12-03 中国人民解放军总医院 Construction method, device and the computer equipment of genopathy correlation analysis system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050026182A1 (en) * 2001-05-25 2005-02-03 Stephane Bejanin Human CDNAS and proteins and uses thereof
US20060122816A1 (en) * 2002-05-20 2006-06-08 Schadt Eric E Computer systems and methods for subdividing a complex disease into component diseases
KR20100001177A (en) * 2008-06-26 2010-01-06 주식회사 비츠로시스 Gene selection algorithm using principal component analysis
CN104462870A (en) * 2015-01-09 2015-03-25 苏州大学 Method and device for identifying human gene promoter
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system
CN106295245A (en) * 2016-07-27 2017-01-04 广州麦仑信息科技有限公司 The method of storehouse noise reduction own coding gene information feature extraction based on Caffe
CN108753827A (en) * 2018-06-20 2018-11-06 上海生博生物医药科技有限公司 A kind of targeting excretion body carries out carrier and its construction method and application that gene carries medicine
CN109979536A (en) * 2019-03-07 2019-07-05 青岛市疾病预防控制中心(青岛市预防医学研究院) It is a kind of based on DNA bar code to the identification method of species
CN110070914A (en) * 2019-03-15 2019-07-30 崔大超 A kind of gene order recognition methods, system and computer readable storage medium
CN110534159A (en) * 2019-07-22 2019-12-03 中国人民解放军总医院 Construction method, device and the computer equipment of genopathy correlation analysis system
CN110468207A (en) * 2019-09-02 2019-11-19 北京师范大学 Based on the glioma EM/PM molecular typing methods of Taqman low density chip and its application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEYEDE ZAHRA PAYLAKHI 等: "A novel gene selection method using GA/SVM and fisher criteria in Alzheimer"s disease", 《2015 23RD IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING》 *
马宝山 等: "用多种统计特征识别基因序列", 《计算机工程与应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687338A (en) * 2020-12-31 2021-04-20 云舟生物科技(广州)有限公司 Method for storing and restoring gene sequence, computer storage medium and electronic device
CN113921083A (en) * 2021-10-27 2022-01-11 云舟生物科技(广州)有限公司 Custom sequence analysis method, computer storage medium and electronic device
CN113921083B (en) * 2021-10-27 2022-11-25 云舟生物科技(广州)股份有限公司 Custom sequence analysis method, computer storage medium and electronic device
CN115881227A (en) * 2022-12-28 2023-03-31 云舟生物科技(广州)股份有限公司 Carrier customization method and computer storage medium
CN115881227B (en) * 2022-12-28 2024-01-26 云舟生物科技(广州)股份有限公司 Carrier customization method and computer storage medium

Also Published As

Publication number Publication date
CN111161800B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN111161800B (en) Method, system, storage medium, and electronic device for diagnosing sequence of gene vector
Boles et al. Digital-to-biological converter for on-demand production of biologics
Wang et al. Reviving the transcriptome studies: an insight into the emergence of single-molecule transcriptome sequencing
Claverie Computational methods for the identification of genes in vertebrate genomic sequences
Sasaki et al. Toward cataloguing all rice genes: large-scale sequencing of randomly chosen rice cDNAs from a callus cDNA library.
Hubé et al. Coding and non-coding RNAs, the frontier has never been so blurred
Sandelin et al. ConSite: web-based prediction of regulatory elements using cross-species comparison
JP2009522663A (en) System and method for remote computer based analysis of chemogenomic data provided to a user
Guerin et al. A novel next-generation sequencing and analysis platform to assess the identity of recombinant adeno-associated viral preparations from viral DNA extracts
CN111145836A (en) Method for inserting gene nucleic acid sequence, computer storage medium and electronic device
Ohler et al. Recognition of unknown conserved alternatively spliced exons
WO2020118596A1 (en) Tag sequence detection method
Varshney et al. A transcription start site map in human pancreatic islets reveals functional regulatory signatures
Withanage et al. RNA-Seq experiment and data analysis
Morrissy et al. Digital gene expression by tag sequencing on the illumina genome analyzer
CN113921082A (en) Gene search weight adjustment method, computer storage medium, and electronic device
US20220205017A1 (en) Methods and compositions for enhanced genome coverage and preservation of spatial proximal contiguity
Guilcher et al. Full length transcriptome highlights the coordination of plastid transcript processing
Eagle et al. Evaluation of five commercial DNA extraction kits using Salmonella as a model for implementation of rapid Nanopore sequencing in routine diagnostic laboratories
Saveliev et al. Accurate and rapid sequence analysis of adeno-associated virus plasmids by Illumina next-generation sequencing
CN112132637B (en) Carrier price calculation method, computer storage medium, and electronic device
Lee et al. Identification of mRNA polyadenylation sites in genomes using cDNA sequences, expressed sequence tags, and Trace
Matsushima et al. Ancestral genome reconstruction enhances transposable element annotation by identifying degenerate integrants
Martin et al. Accessing livestock resources in Ensembl
Patel et al. In vivo and In vitro methods to identify DNA sequence variants that alter RNA Splicing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room d301-d309, Zone D, Guangzhou International Business Incubator, No. 3, Juquan Road, Science City, Guangzhou, Guangdong 510663

Patentee after: Yunzhou Biotechnology (Guangzhou) Co.,Ltd.

Address before: Room d301-d309, 3 / F, building D, Science City International Business Incubator, Huangpu District, Guangzhou, Guangdong Province 510663

Patentee before: YUNZHOU BIOSCIENCES (GUANGZHOU) Inc.

CP03 Change of name, title or address