CN111192629A - Construction method and application of gene sequence difficulty analysis model - Google Patents
Construction method and application of gene sequence difficulty analysis model Download PDFInfo
- Publication number
- CN111192629A CN111192629A CN201911337248.6A CN201911337248A CN111192629A CN 111192629 A CN111192629 A CN 111192629A CN 201911337248 A CN201911337248 A CN 201911337248A CN 111192629 A CN111192629 A CN 111192629A
- Authority
- CN
- China
- Prior art keywords
- sequence
- repeat
- coverage area
- subunit
- regression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 91
- 238000004458 analytical method Methods 0.000 title claims abstract description 17
- 238000010276 construction Methods 0.000 title claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 52
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 35
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 35
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 27
- 238000007637 random forest analysis Methods 0.000 claims description 17
- 238000000034 method Methods 0.000 claims description 14
- 238000012417 linear regression Methods 0.000 claims description 12
- 229920000642 polymer Polymers 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000001308 synthesis method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 241000894006 Bacteria Species 0.000 description 2
- 238000012408 PCR amplification Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 101150090724 3 gene Proteins 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 108700001237 Nucleic Acid-Based Vaccines Proteins 0.000 description 1
- 238000010170 biological method Methods 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 229940023146 nucleic acid vaccine Drugs 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of biology, in particular to a construction method of a gene sequence difficulty analysis model and application thereof, wherein the construction method comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm for the sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.
Description
Technical Field
The invention relates to the technical field of biology, in particular to a construction method and application of a gene sequence difficulty analysis model.
Background
With the continuous development of technologies such as computers, biological information, gene sequencing and the like, the artificial synthesis of whole genes and even genomes becomes possible. Gene synthesis refers to a technique for synthesizing a desired gene in vitro by using a biological method, and can not only modify the existing gene, but also create a gene which does not exist in nature, namely 'modified life' and 'artificial life'. As the gene synthesis technology opens up a brand new direction for human beings to modify organisms, any field connected with genes needs to be artificially synthesized. In the foreseeable future, gene synthesis will play a great role in the fields of life sciences, new energy, new materials, artificial life, nucleic acid vaccines, biomedicine, and the like.
At present, in order to perform gene synthesis rapidly and with high throughput, an industrialized gene synthesis method is provided, so as to meet the increasing requirements of research institutes or enterprises on gene synthesis. The existing industrialized gene synthesis method has 7 modularized steps, which respectively comprise PCR amplification, connection transformation, monoclonal bacteria selection, bacteria liquid PCR identification, plasmid extraction, Sanger sequencing and correct PCR amplification cloning, and finally PCR product fragments consistent with expectations are obtained. The method has various steps and low flux, so the running time of the whole process exceeds 72 hours, and the cost is high. In order to improve the gene synthesis efficiency, chinese patent document CN107760672A discloses an industrial gene synthesis method based on the next generation sequencing technology, which is fast, simple and efficient.
With the increasing demand of gene synthesis, gene synthesis companies can receive a large number of gene sequence synthesis orders from different customers at the same time, the gene sequences to be synthesized are different in variety, the difficulty of the gene sequences is different, the production period of gene sequence synthesis cannot be estimated, even if a standardized industrial gene synthesis method is adopted, the production period of gene synthesis cannot be provided for the customers, meanwhile, due to the uncertainty of the period of the gene sequences to be synthesized, effective overall arrangement cannot be carried out, and the efficiency of gene synthesis is reduced. However, there are no reports on the difficulty of analyzing gene sequences of different gene sequences.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to provide a method for constructing a gene sequence difficulty analysis model and an application thereof, wherein the gene series difficulty analysis model constructed by the method can predict the difficulty of gene sequences of different gene sequences, and according to the difficulty of the gene sequences, a relatively accurate gene synthesis period of a sequence order can be provided for a customer, and meanwhile, the overall arrangement of a gene synthesis company is facilitated, and the production efficiency is improved.
In order to solve the technical problems, the invention provides the following technical scheme:
a method for constructing a gene sequence difficulty analysis model comprises the following steps:
taking a plurality of different gene sequences with known sequence difficulty as a modeling database;
carrying out sequence feature extraction on the gene sequences in the database;
and establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty.
Further, the different genes with known sequence difficulties refer to different genes with known synthesis cycles.
Further, the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.
Preferably, the sequence is characterized by a sequence length, a sequence GC content, a maximum forward repeat coverage area size, a forward maximum repeat to repeat coverage area ratio, a ratio of the sum of forward repeat coverage areas to the sequence length, a maximum reverse repeat coverage area size, a reverse maximum repeat to repeat coverage area ratio, a ratio of the sum of reverse repeat coverage areas to the sequence length, a number of consecutive repeat bases, and a number of polymers.
Preferably, the sequence of the different genes with known sequence difficulty is more than or equal to 500.
Further, the regression algorithm includes a bayesian ridge regression algorithm (bayesian ridge), a linear regression algorithm (linear regression), an elastic network (elastonet), a Support Vector Regression (SVR), a background Gradient Boosting Regression (GBR), a random forest regression (random forest regression), a gradient boosting regression (gradientboosting regression), or an extreme random forest regression (extratress regression).
Further, the method comprises the following steps: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.
The quantitative prediction model is constructed by the construction method.
A gene synthesis period prediction method comprises the step of constructing the obtained quantitative prediction model by using the construction method.
A gene synthesis difficulty analysis device includes:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
and the quantitative prediction model unit is used for establishing a quantitative prediction model by using a regression algorithm on the sequence characteristics and the known sequence difficulty.
Further, the different genes with known sequence difficulties refer to different genes with known synthesis cycles.
Further, the sequence feature extraction unit includes: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.
Preferably, the sequence feature extraction unit includes: the sequence length extraction subunit, the sequence GC content extraction subunit, the maximum forward repeat coverage area size extraction subunit, the forward maximum repeat and repeat coverage area ratio extraction subunit, the forward repeat coverage area sum and sequence length ratio extraction subunit, the maximum reverse repeat coverage area size extraction subunit, the reverse maximum repeat and repeat coverage area ratio extraction subunit, the reverse repeat coverage area sum and sequence length ratio extraction subunit, the continuous repeat base number extraction subunit and the polymer number extraction subunit.
Further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background gradient lifting regression (GBR) subunit, a random forest regression (RandomForestRegr) subunit, a gradient lifting regression (GradientBoosting regression) subunit or an extreme random forest regression (ExtraTreesRegr) subunit.
The quantitative prediction model further comprises a detection unit used for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.
The technical scheme of the invention has the following advantages:
1. the invention provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.
2. The invention provides a method for constructing a gene synthesis difficulty analysis model, which comprises the following steps of: at least three of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of forward repeat coverage area sum to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of reverse repeat coverage area sum to sequence length, number of consecutive repeat bases and number of polymers; in long-term gene synthesis, the sequence characteristics are closely related to the difficulty of the gene sequence, and the difficulty of the gene synthesis can be accurately estimated by selecting the sequence characteristics and the difficulty of the known sequence and utilizing a model constructed by a regression algorithm.
3. The invention provides a method for constructing a gene sequence difficulty analysis model, wherein the regression algorithm comprises Bayesian Ridge, a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoostingregression) or an extreme random forest regression (ExtraTreesregression); researches show that the difficulty of gene synthesis can be accurately estimated by using the sequence characteristics of the gene sequence with known sequence difficulty and the model constructed by the regression algorithm with the known sequence difficulty, and the gene synthesis period can be further predicted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a diagram of a sequence I in the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 2 is a sequence II of the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 3 is a sequence III diagram of the calculation of the maximum forward repeat coverage area size in the present invention;
FIG. 4 is a diagram of sequence four in the calculation of the forward maximal repetition to repetition coverage ratio in the present invention;
FIG. 5 is a diagram of sequence five in the calculation of the ratio of the sum of forward repeat coverage areas to the sequence length in the present invention;
FIG. 6 is a schematic diagram of the apparatus for analyzing difficulty of gene synthesis in example 2 of the present invention.
In fig. 1-5, thin solid lines indicate sequences, A, B, C represents a forward repeat coverage area, and D, E represents a forward repeat sequence.
Detailed Description
In the present invention, the terms of the sequence features are explained as follows:
sequence length refers to the total length of the sequence.
The GC content (GC%) of a sequence is the percentage of the total number of bases in the sequence, based on the sum of the number of bases G and the number of bases C in the sequence.
The maximum forward repeat coverage region size is the length of the region covered by the largest forward repeat among the regions covered by the forward repeats of the sequence (the forward repeats are equal to or greater than 8 bp). And if the interval between the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas is less than 20bp, taking the sum of the lengths of the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas as the maximum forward repeat coverage area size.
In the first sequence shown in fig. 1, taking forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is less than 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is greater than 20bp, and the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B is greater than the length of the forward repeat coverage area C, then the maximum forward repeat coverage area size is the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B.
In the second sequence shown in fig. 2, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is > 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the length of the forward repeat coverage area C.
In the third sequence shown in fig. 3, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is < 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the sum of the length of the forward repeat coverage area C and the length of the forward repeat coverage area B.
The forward maximum repetition to repetition coverage ratio means that the repetition coverage area is composed of a number of repetitions, and in the forward repetition coverage area is composed of a number of forward repetitions, and the sequence length of the maximum forward repetition is divided by the length of the repetition coverage area in which it is located.
As in the fourth sequence shown in fig. 4, taking the forward repeat as an example, only the forward repeat coverage area a is included in the sequence, including the forward repeat D and the forward repeat E, and if the length of the forward repeat D > the length of the forward repeat E, the ratio of the forward maximum repeat to the repeat coverage area is equal to the length of the forward repeat D/the length of the forward repeat coverage area a.
The ratio of the sum of forward repeat coverage areas to the length of the sequence is the sum of the lengths of all forward repeat coverage areas in the sequence divided by the sequence length. In the fifth sequence shown in fig. 5, taking forward repeat as an example, the sequence includes only the forward repeat coverage areas A, B and C, and the ratio of the sum of the forward repeat coverage areas to the sequence length is (length of the forward repeat coverage area a + length of the forward repeat coverage area B + length of the forward repeat coverage area C)/the sequence length.
The maximum reverse repeat coverage area size is calculated in the same way as the maximum forward repeat coverage area size, except that the reverse repeat is calculated.
The reverse maximal repetition to repeat coverage ratio is calculated in the same way as the forward maximal repetition to repeat coverage ratio, except that the reverse repetition is calculated.
The ratio of the sum of forward repeat coverage areas to the sequence length is calculated in the same way as the ratio of the sum of forward repeat coverage areas to the sequence length, except that the reverse repeat is calculated.
The number of consecutive repeats is the number of any one of bases A, T, C or G consecutive repeats in the sequence.
The number of polymers is the sum of the number of poly structures such as polyA, polyD, etc., present in the sequence.
Example 1 construction method of Gene sequence difficulty analysis model
The embodiment provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps:
(1) taking 500 different gene sequences with known synthesis periods (the gene sequences are provided by Jinzhi Biotechnology GmbH) as a modeling database;
(2) extracting sequence characteristics of the gene sequences in the database, wherein the extracted sequence characteristics comprise sequence length, sequence GC content, maximum forward repeat coverage area size, forward maximum repeat to repeat coverage area ratio, ratio of the sum of forward repeat coverage areas to the sequence length, maximum reverse repeat coverage area size, reverse maximum repeat to repeat coverage area ratio, ratio of the sum of reverse repeat coverage areas to the sequence length, the number of continuous repeat bases and the number of polymers; the sequence feature data of the gene sequences in the database is provided by Jinzhi Biotechnology, Inc.;
(3) and (3) establishing a quantitative prediction model by using a regression algorithm according to the sequence feature data obtained in the step (2) and the known sequence difficulty, wherein the regression algorithm model comprises a Bayesian ridge regression algorithm (Bayesian ridge), a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoosting regression) or an extreme random forest regression (ExtraTreesRegister), and the regression algorithm selected in the embodiment is used for establishing the quantitative prediction model. With R2The final fitting result is R for the evaluation index of the quantitative prediction model20.9, which indicates that the quantitative prediction model constructed in the present example has excellent prediction performance.
(4) And (3) predicting the sequence to be tested, extracting the sequence characteristics in the step (2) from the sequence to be tested, introducing the sequence characteristics into the quantitative prediction model constructed in the step (3), and calculating to obtain the synthesis difficulty of the sequence to be tested.
Example 2 prediction of Gene Synthesis cycle
In this embodiment, the difficulty of the sequence to be detected is predicted by using the quantitative prediction model constructed in embodiment 1, and the period of gene synthesis of the sequence to be detected is obtained according to the predicted difficulty.
Example 3 Gene sequence difficulty analysis device
The present embodiment provides a gene sequence difficulty analysis device, where the structure diagram of the device is shown in fig. 6, and the device includes:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
further, the sequence feature extraction unit includes: a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area proportion extraction subunit, a forward repeat coverage area sum and sequence length proportion extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area proportion extraction subunit, a reverse repeat coverage area sum and sequence length proportion extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit;
the quantitative prediction model unit is used for establishing a quantitative prediction model by utilizing a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty;
further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background Gradient Boost Regression (GBR) subunit, a random forest regression (RandomForestReducer) subunit, a gradient boost regression (GradientBoostigReducer) subunit or an extreme random forest regression (ExtraTreesReducer) subunit.
The quantitative prediction model further comprises a detection unit for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.
Claims (10)
1. A method for constructing a gene sequence difficulty analysis model is characterized by comprising the following steps:
taking a plurality of different gene sequences with known sequence difficulty as a modeling database;
carrying out sequence feature extraction on the gene sequences in the database;
and establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty.
2. The construction method according to claim 1, wherein the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.
3. The construction method according to claim 1 or 2, wherein the regression algorithm comprises a bayesian ridge regression algorithm, a linear regression algorithm, an elastic network, a support vector regression, a background gradient boost regression, a random forest regression, a gradient boost regression or an extreme random forest regression.
4. The construction method according to any one of claims 1 to 3, comprising: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.
5. Use of a quantitative prediction model constructed by the construction method according to any one of claims 1 to 4 for predicting a gene synthesis cycle.
6. A method for predicting a gene synthesis cycle, comprising constructing a quantitative prediction model by the construction method according to any one of claims 1 to 4.
7. A gene sequence difficulty analysis device is characterized by comprising:
the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;
the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;
and the quantitative prediction model unit is used for establishing a quantitative prediction model by using a regression algorithm on the sequence characteristics and the known sequence difficulty.
8. The apparatus of claim 7, wherein the sequence feature extraction unit comprises: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.
9. The apparatus according to claim 7 or 8, wherein the quantitative prediction model unit comprises: the regression algorithm comprises a Bayesian ridge regression algorithm subunit, a linear regression algorithm subunit, an elastic network subunit, a support vector regression subunit, a background gradient boost regression subunit, a random forest regression subunit, a gradient boost regression subunit or an extreme random forest regression subunit.
10. The apparatus according to any one of claims 7 to 9, comprising a detection unit for extracting sequence features of a sequence to be tested and then introducing the obtained sequence features into the quantitative prediction model.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911337248.6A CN111192629A (en) | 2019-12-23 | 2019-12-23 | Construction method and application of gene sequence difficulty analysis model |
PCT/CN2020/119562 WO2021129035A1 (en) | 2019-12-23 | 2020-09-30 | Method for constructing model for gene sequence synthesis difficulty analysis and use thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911337248.6A CN111192629A (en) | 2019-12-23 | 2019-12-23 | Construction method and application of gene sequence difficulty analysis model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111192629A true CN111192629A (en) | 2020-05-22 |
Family
ID=70707430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911337248.6A Pending CN111192629A (en) | 2019-12-23 | 2019-12-23 | Construction method and application of gene sequence difficulty analysis model |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111192629A (en) |
WO (1) | WO2021129035A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021129035A1 (en) * | 2019-12-23 | 2021-07-01 | 苏州金唯智生物科技有限公司 | Method for constructing model for gene sequence synthesis difficulty analysis and use thereof |
CN116705176A (en) * | 2023-06-27 | 2023-09-05 | 苏州君跻基因科技有限公司 | Analysis method, system and equipment for synthesis difficulty of gene synthesis sequence |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109887546A (en) * | 2019-01-15 | 2019-06-14 | 明码(上海)生物科技有限公司 | A kind of single-gene or polygenes copy number detection system and method based on two generation sequencing technologies |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599615B (en) * | 2016-11-30 | 2019-04-05 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | A kind of sequence signature analysis method for predicting miRNA target gene |
CN108133122B (en) * | 2016-12-01 | 2020-09-15 | 深圳华大基因股份有限公司 | Gene clustering method and metagenome assembly method and device based on same |
CN108614955A (en) * | 2018-05-04 | 2018-10-02 | 吉林大学 | One kind is formed based on sequence, the lncRNA identification methods of structural information and physicochemical characteristics |
CN110517731A (en) * | 2019-10-23 | 2019-11-29 | 上海思路迪医学检验所有限公司 | Genetic test quality monitoring data processing method and system |
CN111192629A (en) * | 2019-12-23 | 2020-05-22 | 苏州金唯智生物科技有限公司 | Construction method and application of gene sequence difficulty analysis model |
-
2019
- 2019-12-23 CN CN201911337248.6A patent/CN111192629A/en active Pending
-
2020
- 2020-09-30 WO PCT/CN2020/119562 patent/WO2021129035A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109887546A (en) * | 2019-01-15 | 2019-06-14 | 明码(上海)生物科技有限公司 | A kind of single-gene or polygenes copy number detection system and method based on two generation sequencing technologies |
Non-Patent Citations (2)
Title |
---|
彭超等: "宏基因组中可移动序列的精确检测问题研究" * |
李斌: "LZ复杂性算法及其在生物序列分析中的应用研究" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021129035A1 (en) * | 2019-12-23 | 2021-07-01 | 苏州金唯智生物科技有限公司 | Method for constructing model for gene sequence synthesis difficulty analysis and use thereof |
CN116705176A (en) * | 2023-06-27 | 2023-09-05 | 苏州君跻基因科技有限公司 | Analysis method, system and equipment for synthesis difficulty of gene synthesis sequence |
Also Published As
Publication number | Publication date |
---|---|
WO2021129035A1 (en) | 2021-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tunney et al. | Accurate design of translational output by a neural network model of ribosome distribution | |
Jin et al. | scEpath: energy landscape-based inference of transition probabilities and cellular trajectories from single-cell transcriptomic data | |
Äijö et al. | Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation | |
CN111192629A (en) | Construction method and application of gene sequence difficulty analysis model | |
Galitsyna et al. | Single-cell Hi-C data analysis: safety in numbers | |
CN110656157B (en) | Quality control product for tracing high-throughput sequencing sample and design and use method thereof | |
Todorov et al. | Network inference from single-cell transcriptomic data | |
CN104504288A (en) | Method for non-linear multistage intermittent process soft measurement based on multi-directional support vector cluster | |
Backofen et al. | RNA-bioinformatics: tools, services and databases for the analysis of RNA-based regulation | |
CN104313146A (en) | Method for developing genome simple sequence repeats (SSR) molecular marker | |
Menon et al. | Bioinformatics tools and methods to analyze single-cell RNA sequencing data | |
Schmidt et al. | Developmental scRNAseq trajectories in gene-and cell-state space—The flatworm example | |
Tan et al. | HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes | |
Zakrzewski et al. | MetaSAMS—a novel software platform for taxonomic classification, functional annotation and comparative analysis of metagenome datasets | |
Weber et al. | Identification of gene regulation models from single-cell data | |
Brocken et al. | The organization of bacterial genomes: towards understanding the interplay between structure and function | |
Xiao et al. | Highly multiplexed single-cell In situ RNA and DNA analysis by consecutive hybridization | |
CN113393900A (en) | RNA state inference research method based on improved Transformer model | |
Eggenhofer et al. | CMCompare webserver: comparing RNA families via covariance models | |
Hejna et al. | Quantification of mammalian tumor cell state plasticity with digital holographic cytometry | |
Shi et al. | Spatial Omics Sequencing Based on Microfluidic Array Chips | |
Yasrebi et al. | EMOTE-conv: a computational pipeline to convert exact mapping of transcriptome ends (EMOTE) data to the lists of quantified genomic positions correlated to related genomic information | |
Tripodi | The Evolution of Molecular Genotyping in Plant Breeding | |
Sandelin | In silico prediction of cis-regulatory elements | |
Lück et al. | Generalized method of moments estimation for stochastic models of DNA methylation patterns |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200522 |