CN111192629A - Construction method and application of gene sequence difficulty analysis model - Google Patents

Construction method and application of gene sequence difficulty analysis model Download PDF

Info

Publication number
CN111192629A
CN111192629A CN201911337248.6A CN201911337248A CN111192629A CN 111192629 A CN111192629 A CN 111192629A CN 201911337248 A CN201911337248 A CN 201911337248A CN 111192629 A CN111192629 A CN 111192629A
Authority
CN
China
Prior art keywords
sequence
repeat
coverage area
subunit
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911337248.6A
Other languages
Chinese (zh)
Inventor
赵文妍
段广有
丁砚书
方其
张艳
葛毅
廖国娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genewiz Suzhou Ltd
Original Assignee
Genewiz Suzhou Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genewiz Suzhou Ltd filed Critical Genewiz Suzhou Ltd
Priority to CN201911337248.6A priority Critical patent/CN111192629A/en
Publication of CN111192629A publication Critical patent/CN111192629A/en
Priority to PCT/CN2020/119562 priority patent/WO2021129035A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of biology, in particular to a construction method of a gene sequence difficulty analysis model and application thereof, wherein the construction method comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm for the sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.

Description

Construction method and application of gene sequence difficulty analysis model
Technical Field
The invention relates to the technical field of biology, in particular to a construction method and application of a gene sequence difficulty analysis model.
Background
With the continuous development of technologies such as computers, biological information, gene sequencing and the like, the artificial synthesis of whole genes and even genomes becomes possible. Gene synthesis refers to a technique for synthesizing a desired gene in vitro by using a biological method, and can not only modify the existing gene, but also create a gene which does not exist in nature, namely 'modified life' and 'artificial life'. As the gene synthesis technology opens up a brand new direction for human beings to modify organisms, any field connected with genes needs to be artificially synthesized. In the foreseeable future, gene synthesis will play a great role in the fields of life sciences, new energy, new materials, artificial life, nucleic acid vaccines, biomedicine, and the like.
At present, in order to perform gene synthesis rapidly and with high throughput, an industrialized gene synthesis method is provided, so as to meet the increasing requirements of research institutes or enterprises on gene synthesis. The existing industrialized gene synthesis method has 7 modularized steps, which respectively comprise PCR amplification, connection transformation, monoclonal bacteria selection, bacteria liquid PCR identification, plasmid extraction, Sanger sequencing and correct PCR amplification cloning, and finally PCR product fragments consistent with expectations are obtained. The method has various steps and low flux, so the running time of the whole process exceeds 72 hours, and the cost is high. In order to improve the gene synthesis efficiency, chinese patent document CN107760672A discloses an industrial gene synthesis method based on the next generation sequencing technology, which is fast, simple and efficient.
With the increasing demand of gene synthesis, gene synthesis companies can receive a large number of gene sequence synthesis orders from different customers at the same time, the gene sequences to be synthesized are different in variety, the difficulty of the gene sequences is different, the production period of gene sequence synthesis cannot be estimated, even if a standardized industrial gene synthesis method is adopted, the production period of gene synthesis cannot be provided for the customers, meanwhile, due to the uncertainty of the period of the gene sequences to be synthesized, effective overall arrangement cannot be carried out, and the efficiency of gene synthesis is reduced. However, there are no reports on the difficulty of analyzing gene sequences of different gene sequences.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to provide a method for constructing a gene sequence difficulty analysis model and an application thereof, wherein the gene series difficulty analysis model constructed by the method can predict the difficulty of gene sequences of different gene sequences, and according to the difficulty of the gene sequences, a relatively accurate gene synthesis period of a sequence order can be provided for a customer, and meanwhile, the overall arrangement of a gene synthesis company is facilitated, and the production efficiency is improved.
In order to solve the technical problems, the invention provides the following technical scheme:
a method for constructing a gene sequence difficulty analysis model comprises the following steps:
taking a plurality of different gene sequences with known sequence difficulty as a modeling database;
carrying out sequence feature extraction on the gene sequences in the database;
and establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty.
Further, the different genes with known sequence difficulties refer to different genes with known synthesis cycles.
Further, the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.
Preferably, the sequence is characterized by a sequence length, a sequence GC content, a maximum forward repeat coverage area size, a forward maximum repeat to repeat coverage area ratio, a ratio of the sum of forward repeat coverage areas to the sequence length, a maximum reverse repeat coverage area size, a reverse maximum repeat to repeat coverage area ratio, a ratio of the sum of reverse repeat coverage areas to the sequence length, a number of consecutive repeat bases, and a number of polymers.
Preferably, the sequence of the different genes with known sequence difficulty is more than or equal to 500.
Further, the regression algorithm includes a bayesian ridge regression algorithm (bayesian ridge), a linear regression algorithm (linear regression), an elastic network (elastonet), a Support Vector Regression (SVR), a background Gradient Boosting Regression (GBR), a random forest regression (random forest regression), a gradient boosting regression (gradientboosting regression), or an extreme random forest regression (extratress regression).
Further, the method comprises the following steps: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.
The quantitative prediction model is constructed by the construction method.
A gene synthesis period prediction method comprises the step of constructing the obtained quantitative prediction model by using the construction method.
A gene synthesis difficulty analysis device includes:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
and the quantitative prediction model unit is used for establishing a quantitative prediction model by using a regression algorithm on the sequence characteristics and the known sequence difficulty.
Further, the different genes with known sequence difficulties refer to different genes with known synthesis cycles.
Further, the sequence feature extraction unit includes: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.
Preferably, the sequence feature extraction unit includes: the sequence length extraction subunit, the sequence GC content extraction subunit, the maximum forward repeat coverage area size extraction subunit, the forward maximum repeat and repeat coverage area ratio extraction subunit, the forward repeat coverage area sum and sequence length ratio extraction subunit, the maximum reverse repeat coverage area size extraction subunit, the reverse maximum repeat and repeat coverage area ratio extraction subunit, the reverse repeat coverage area sum and sequence length ratio extraction subunit, the continuous repeat base number extraction subunit and the polymer number extraction subunit.
Further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background gradient lifting regression (GBR) subunit, a random forest regression (RandomForestRegr) subunit, a gradient lifting regression (GradientBoosting regression) subunit or an extreme random forest regression (ExtraTreesRegr) subunit.
The quantitative prediction model further comprises a detection unit used for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.
The technical scheme of the invention has the following advantages:
1. the invention provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.
2. The invention provides a method for constructing a gene synthesis difficulty analysis model, which comprises the following steps of: at least three of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of forward repeat coverage area sum to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of reverse repeat coverage area sum to sequence length, number of consecutive repeat bases and number of polymers; in long-term gene synthesis, the sequence characteristics are closely related to the difficulty of the gene sequence, and the difficulty of the gene synthesis can be accurately estimated by selecting the sequence characteristics and the difficulty of the known sequence and utilizing a model constructed by a regression algorithm.
3. The invention provides a method for constructing a gene sequence difficulty analysis model, wherein the regression algorithm comprises Bayesian Ridge, a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoostingregression) or an extreme random forest regression (ExtraTreesregression); researches show that the difficulty of gene synthesis can be accurately estimated by using the sequence characteristics of the gene sequence with known sequence difficulty and the model constructed by the regression algorithm with the known sequence difficulty, and the gene synthesis period can be further predicted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a diagram of a sequence I in the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 2 is a sequence II of the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 3 is a sequence III diagram of the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 4 is a diagram of sequence four in the calculation of the forward maximal repetition to repetition coverage ratio in the present invention;
FIG. 5 is a diagram of sequence five in the calculation of the ratio of the sum of forward repeat coverage areas to the sequence length in the present invention;
FIG. 6 is a schematic diagram of the apparatus for analyzing difficulty of gene synthesis in example 2 of the present invention.
In fig. 1-5, thin solid lines indicate sequences, A, B, C represents a forward repeat coverage area, and D, E represents a forward repeat sequence.
Detailed Description
In the present invention, the terms of the sequence features are explained as follows:
sequence length refers to the total length of the sequence.
The GC content (GC%) of a sequence is the percentage of the total number of bases in the sequence, based on the sum of the number of bases G and the number of bases C in the sequence.
The maximum forward repeat coverage region size is the length of the region covered by the largest forward repeat among the regions covered by the forward repeats of the sequence (the forward repeats are equal to or greater than 8 bp). And if the interval between the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas is less than 20bp, taking the sum of the lengths of the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas as the maximum forward repeat coverage area size.
In the first sequence shown in fig. 1, taking forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is less than 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is greater than 20bp, and the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B is greater than the length of the forward repeat coverage area C, then the maximum forward repeat coverage area size is the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B.
In the second sequence shown in fig. 2, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is > 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the length of the forward repeat coverage area C.
In the third sequence shown in fig. 3, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is < 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the sum of the length of the forward repeat coverage area C and the length of the forward repeat coverage area B.
The forward maximum repetition to repetition coverage ratio means that the repetition coverage area is composed of a number of repetitions, and in the forward repetition coverage area is composed of a number of forward repetitions, and the sequence length of the maximum forward repetition is divided by the length of the repetition coverage area in which it is located.
As in the fourth sequence shown in fig. 4, taking the forward repeat as an example, only the forward repeat coverage area a is included in the sequence, including the forward repeat D and the forward repeat E, and if the length of the forward repeat D > the length of the forward repeat E, the ratio of the forward maximum repeat to the repeat coverage area is equal to the length of the forward repeat D/the length of the forward repeat coverage area a.
The ratio of the sum of forward repeat coverage areas to the length of the sequence is the sum of the lengths of all forward repeat coverage areas in the sequence divided by the sequence length. In the fifth sequence shown in fig. 5, taking forward repeat as an example, the sequence includes only the forward repeat coverage areas A, B and C, and the ratio of the sum of the forward repeat coverage areas to the sequence length is (length of the forward repeat coverage area a + length of the forward repeat coverage area B + length of the forward repeat coverage area C)/the sequence length.
The maximum reverse repeat coverage area size is calculated in the same way as the maximum forward repeat coverage area size, except that the reverse repeat is calculated.
The reverse maximal repetition to repeat coverage ratio is calculated in the same way as the forward maximal repetition to repeat coverage ratio, except that the reverse repetition is calculated.
The ratio of the sum of forward repeat coverage areas to the sequence length is calculated in the same way as the ratio of the sum of forward repeat coverage areas to the sequence length, except that the reverse repeat is calculated.
The number of consecutive repeats is the number of any one of bases A, T, C or G consecutive repeats in the sequence.
The number of polymers is the sum of the number of poly structures such as polyA, polyD, etc., present in the sequence.
Example 1 construction method of Gene sequence difficulty analysis model
The embodiment provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps:
(1) taking 500 different gene sequences with known synthesis periods (the gene sequences are provided by Jinzhi Biotechnology GmbH) as a modeling database;
(2) extracting sequence characteristics of the gene sequences in the database, wherein the extracted sequence characteristics comprise sequence length, sequence GC content, maximum forward repeat coverage area size, forward maximum repeat to repeat coverage area ratio, ratio of the sum of forward repeat coverage areas to the sequence length, maximum reverse repeat coverage area size, reverse maximum repeat to repeat coverage area ratio, ratio of the sum of reverse repeat coverage areas to the sequence length, the number of continuous repeat bases and the number of polymers; the sequence feature data of the gene sequences in the database is provided by Jinzhi Biotechnology, Inc.;
(3) and (3) establishing a quantitative prediction model by using a regression algorithm according to the sequence feature data obtained in the step (2) and the known sequence difficulty, wherein the regression algorithm model comprises a Bayesian ridge regression algorithm (Bayesian ridge), a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoosting regression) or an extreme random forest regression (ExtraTreesRegister), and the regression algorithm selected in the embodiment is used for establishing the quantitative prediction model. With R2The final fitting result is R for the evaluation index of the quantitative prediction model20.9, which indicates that the quantitative prediction model constructed in the present example has excellent prediction performance.
(4) And (3) predicting the sequence to be tested, extracting the sequence characteristics in the step (2) from the sequence to be tested, introducing the sequence characteristics into the quantitative prediction model constructed in the step (3), and calculating to obtain the synthesis difficulty of the sequence to be tested.
Example 2 prediction of Gene Synthesis cycle
In this embodiment, the difficulty of the sequence to be detected is predicted by using the quantitative prediction model constructed in embodiment 1, and the period of gene synthesis of the sequence to be detected is obtained according to the predicted difficulty.
Example 3 Gene sequence difficulty analysis device
The present embodiment provides a gene sequence difficulty analysis device, where the structure diagram of the device is shown in fig. 6, and the device includes:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
further, the sequence feature extraction unit includes: a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area proportion extraction subunit, a forward repeat coverage area sum and sequence length proportion extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area proportion extraction subunit, a reverse repeat coverage area sum and sequence length proportion extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit;
the quantitative prediction model unit is used for establishing a quantitative prediction model by utilizing a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty;
further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background Gradient Boost Regression (GBR) subunit, a random forest regression (RandomForestReducer) subunit, a gradient boost regression (GradientBoostigReducer) subunit or an extreme random forest regression (ExtraTreesReducer) subunit.
The quantitative prediction model further comprises a detection unit for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A method for constructing a gene sequence difficulty analysis model is characterized by comprising the following steps:
taking a plurality of different gene sequences with known sequence difficulty as a modeling database;
carrying out sequence feature extraction on the gene sequences in the database;
and establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty.
2. The construction method according to claim 1, wherein the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.
3. The construction method according to claim 1 or 2, wherein the regression algorithm comprises a bayesian ridge regression algorithm, a linear regression algorithm, an elastic network, a support vector regression, a background gradient boost regression, a random forest regression, a gradient boost regression or an extreme random forest regression.
4. The construction method according to any one of claims 1 to 3, comprising: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.
5. Use of a quantitative prediction model constructed by the construction method according to any one of claims 1 to 4 for predicting a gene synthesis cycle.
6. A method for predicting a gene synthesis cycle, comprising constructing a quantitative prediction model by the construction method according to any one of claims 1 to 4.
7. A gene sequence difficulty analysis device is characterized by comprising:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
and the quantitative prediction model unit is used for establishing a quantitative prediction model by using a regression algorithm on the sequence characteristics and the known sequence difficulty.
8. The apparatus of claim 7, wherein the sequence feature extraction unit comprises: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.
9. The apparatus according to claim 7 or 8, wherein the quantitative prediction model unit comprises: the regression algorithm comprises a Bayesian ridge regression algorithm subunit, a linear regression algorithm subunit, an elastic network subunit, a support vector regression subunit, a background gradient boost regression subunit, a random forest regression subunit, a gradient boost regression subunit or an extreme random forest regression subunit.
10. The apparatus according to any one of claims 7 to 9, comprising a detection unit for extracting sequence features of a sequence to be tested and then introducing the obtained sequence features into the quantitative prediction model.
CN201911337248.6A 2019-12-23 2019-12-23 Construction method and application of gene sequence difficulty analysis model Pending CN111192629A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911337248.6A CN111192629A (en) 2019-12-23 2019-12-23 Construction method and application of gene sequence difficulty analysis model
PCT/CN2020/119562 WO2021129035A1 (en) 2019-12-23 2020-09-30 Method for constructing model for gene sequence synthesis difficulty analysis and use thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911337248.6A CN111192629A (en) 2019-12-23 2019-12-23 Construction method and application of gene sequence difficulty analysis model

Publications (1)

Publication Number Publication Date
CN111192629A true CN111192629A (en) 2020-05-22

Family

ID=70707430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911337248.6A Pending CN111192629A (en) 2019-12-23 2019-12-23 Construction method and application of gene sequence difficulty analysis model

Country Status (2)

Country Link
CN (1) CN111192629A (en)
WO (1) WO2021129035A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021129035A1 (en) * 2019-12-23 2021-07-01 苏州金唯智生物科技有限公司 Method for constructing model for gene sequence synthesis difficulty analysis and use thereof
CN116705176A (en) * 2023-06-27 2023-09-05 苏州君跻基因科技有限公司 Analysis method, system and equipment for synthesis difficulty of gene synthesis sequence

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887546A (en) * 2019-01-15 2019-06-14 明码(上海)生物科技有限公司 A kind of single-gene or polygenes copy number detection system and method based on two generation sequencing technologies

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599615B (en) * 2016-11-30 2019-04-05 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of sequence signature analysis method for predicting miRNA target gene
CN108133122B (en) * 2016-12-01 2020-09-15 深圳华大基因股份有限公司 Gene clustering method and metagenome assembly method and device based on same
CN108614955A (en) * 2018-05-04 2018-10-02 吉林大学 One kind is formed based on sequence, the lncRNA identification methods of structural information and physicochemical characteristics
CN110517731A (en) * 2019-10-23 2019-11-29 上海思路迪医学检验所有限公司 Genetic test quality monitoring data processing method and system
CN111192629A (en) * 2019-12-23 2020-05-22 苏州金唯智生物科技有限公司 Construction method and application of gene sequence difficulty analysis model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887546A (en) * 2019-01-15 2019-06-14 明码(上海)生物科技有限公司 A kind of single-gene or polygenes copy number detection system and method based on two generation sequencing technologies

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
彭超等: "宏基因组中可移动序列的精确检测问题研究" *
李斌: "LZ复杂性算法及其在生物序列分析中的应用研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021129035A1 (en) * 2019-12-23 2021-07-01 苏州金唯智生物科技有限公司 Method for constructing model for gene sequence synthesis difficulty analysis and use thereof
CN116705176A (en) * 2023-06-27 2023-09-05 苏州君跻基因科技有限公司 Analysis method, system and equipment for synthesis difficulty of gene synthesis sequence

Also Published As

Publication number Publication date
WO2021129035A1 (en) 2021-07-01

Similar Documents

Publication Publication Date Title
Tunney et al. Accurate design of translational output by a neural network model of ribosome distribution
Jin et al. scEpath: energy landscape-based inference of transition probabilities and cellular trajectories from single-cell transcriptomic data
Äijö et al. Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation
CN111192629A (en) Construction method and application of gene sequence difficulty analysis model
Galitsyna et al. Single-cell Hi-C data analysis: safety in numbers
CN110656157B (en) Quality control product for tracing high-throughput sequencing sample and design and use method thereof
Todorov et al. Network inference from single-cell transcriptomic data
CN104504288A (en) Method for non-linear multistage intermittent process soft measurement based on multi-directional support vector cluster
Backofen et al. RNA-bioinformatics: tools, services and databases for the analysis of RNA-based regulation
CN104313146A (en) Method for developing genome simple sequence repeats (SSR) molecular marker
Menon et al. Bioinformatics tools and methods to analyze single-cell RNA sequencing data
Schmidt et al. Developmental scRNAseq trajectories in gene-and cell-state space—The flatworm example
Tan et al. HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes
Zakrzewski et al. MetaSAMS—a novel software platform for taxonomic classification, functional annotation and comparative analysis of metagenome datasets
Weber et al. Identification of gene regulation models from single-cell data
Brocken et al. The organization of bacterial genomes: towards understanding the interplay between structure and function
Xiao et al. Highly multiplexed single-cell In situ RNA and DNA analysis by consecutive hybridization
CN113393900A (en) RNA state inference research method based on improved Transformer model
Eggenhofer et al. CMCompare webserver: comparing RNA families via covariance models
Hejna et al. Quantification of mammalian tumor cell state plasticity with digital holographic cytometry
Shi et al. Spatial Omics Sequencing Based on Microfluidic Array Chips
Yasrebi et al. EMOTE-conv: a computational pipeline to convert exact mapping of transcriptome ends (EMOTE) data to the lists of quantified genomic positions correlated to related genomic information
Tripodi The Evolution of Molecular Genotyping in Plant Breeding
Sandelin In silico prediction of cis-regulatory elements
Lück et al. Generalized method of moments estimation for stochastic models of DNA methylation patterns

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200522