CN111192629A

CN111192629A - Construction method and application of gene sequence difficulty analysis model

Info

Publication number: CN111192629A
Application number: CN201911337248.6A
Authority: CN
Inventors: 赵文妍; 段广有; 丁砚书; 方其; 张艳; 葛毅; 廖国娟
Original assignee: Genewiz Suzhou Ltd
Current assignee: Genewiz Suzhou Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-22
Also published as: WO2021129035A1

Abstract

The invention relates to the technical field of biology, in particular to a construction method of a gene sequence difficulty analysis model and application thereof, wherein the construction method comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm for the sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.

Description

Construction method and application of gene sequence difficulty analysis model

Technical Field

The invention relates to the technical field of biology, in particular to a construction method and application of a gene sequence difficulty analysis model.

Background

With the continuous development of technologies such as computers, biological information, gene sequencing and the like, the artificial synthesis of whole genes and even genomes becomes possible. Gene synthesis refers to a technique for synthesizing a desired gene in vitro by using a biological method, and can not only modify the existing gene, but also create a gene which does not exist in nature, namely 'modified life' and 'artificial life'. As the gene synthesis technology opens up a brand new direction for human beings to modify organisms, any field connected with genes needs to be artificially synthesized. In the foreseeable future, gene synthesis will play a great role in the fields of life sciences, new energy, new materials, artificial life, nucleic acid vaccines, biomedicine, and the like.

At present, in order to perform gene synthesis rapidly and with high throughput, an industrialized gene synthesis method is provided, so as to meet the increasing requirements of research institutes or enterprises on gene synthesis. The existing industrialized gene synthesis method has 7 modularized steps, which respectively comprise PCR amplification, connection transformation, monoclonal bacteria selection, bacteria liquid PCR identification, plasmid extraction, Sanger sequencing and correct PCR amplification cloning, and finally PCR product fragments consistent with expectations are obtained. The method has various steps and low flux, so the running time of the whole process exceeds 72 hours, and the cost is high. In order to improve the gene synthesis efficiency, chinese patent document CN107760672A discloses an industrial gene synthesis method based on the next generation sequencing technology, which is fast, simple and efficient.

With the increasing demand of gene synthesis, gene synthesis companies can receive a large number of gene sequence synthesis orders from different customers at the same time, the gene sequences to be synthesized are different in variety, the difficulty of the gene sequences is different, the production period of gene sequence synthesis cannot be estimated, even if a standardized industrial gene synthesis method is adopted, the production period of gene synthesis cannot be provided for the customers, meanwhile, due to the uncertainty of the period of the gene sequences to be synthesized, effective overall arrangement cannot be carried out, and the efficiency of gene synthesis is reduced. However, there are no reports on the difficulty of analyzing gene sequences of different gene sequences.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to provide a method for constructing a gene sequence difficulty analysis model and an application thereof, wherein the gene series difficulty analysis model constructed by the method can predict the difficulty of gene sequences of different gene sequences, and according to the difficulty of the gene sequences, a relatively accurate gene synthesis period of a sequence order can be provided for a customer, and meanwhile, the overall arrangement of a gene synthesis company is facilitated, and the production efficiency is improved.

In order to solve the technical problems, the invention provides the following technical scheme:

a method for constructing a gene sequence difficulty analysis model comprises the following steps:

taking a plurality of different gene sequences with known sequence difficulty as a modeling database;

carrying out sequence feature extraction on the gene sequences in the database;

and establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty.

Further, the different genes with known sequence difficulties refer to different genes with known synthesis cycles.

Further, the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.

Preferably, the sequence is characterized by a sequence length, a sequence GC content, a maximum forward repeat coverage area size, a forward maximum repeat to repeat coverage area ratio, a ratio of the sum of forward repeat coverage areas to the sequence length, a maximum reverse repeat coverage area size, a reverse maximum repeat to repeat coverage area ratio, a ratio of the sum of reverse repeat coverage areas to the sequence length, a number of consecutive repeat bases, and a number of polymers.

Preferably, the sequence of the different genes with known sequence difficulty is more than or equal to 500.

Further, the regression algorithm includes a bayesian ridge regression algorithm (bayesian ridge), a linear regression algorithm (linear regression), an elastic network (elastonet), a Support Vector Regression (SVR), a background Gradient Boosting Regression (GBR), a random forest regression (random forest regression), a gradient boosting regression (gradientboosting regression), or an extreme random forest regression (extratress regression).

Further, the method comprises the following steps: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.

The quantitative prediction model is constructed by the construction method.

A gene synthesis period prediction method comprises the step of constructing the obtained quantitative prediction model by using the construction method.

A gene synthesis difficulty analysis device includes:

the database unit is used for acquiring a plurality of different gene sequences with known sequence difficulty;

the sequence feature extraction unit is used for extracting the sequence features of the gene sequences in the database unit;

and the quantitative prediction model unit is used for establishing a quantitative prediction model by using a regression algorithm on the sequence characteristics and the known sequence difficulty.

Further, the sequence feature extraction unit includes: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.

Preferably, the sequence feature extraction unit includes: the sequence length extraction subunit, the sequence GC content extraction subunit, the maximum forward repeat coverage area size extraction subunit, the forward maximum repeat and repeat coverage area ratio extraction subunit, the forward repeat coverage area sum and sequence length ratio extraction subunit, the maximum reverse repeat coverage area size extraction subunit, the reverse maximum repeat and repeat coverage area ratio extraction subunit, the reverse repeat coverage area sum and sequence length ratio extraction subunit, the continuous repeat base number extraction subunit and the polymer number extraction subunit.

Further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background gradient lifting regression (GBR) subunit, a random forest regression (RandomForestRegr) subunit, a gradient lifting regression (GradientBoosting regression) subunit or an extreme random forest regression (ExtraTreesRegr) subunit.

The quantitative prediction model further comprises a detection unit used for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.

The technical scheme of the invention has the following advantages:

1. the invention provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps: taking a plurality of different gene sequences with known sequence difficulty as a modeling database; carrying out sequence feature extraction on the gene sequences in the database; establishing a quantitative prediction model by using a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty; in the production process, the difficulty of finding the gene sequence of the sequence to be detected cannot be predicted, so that the requirement of a customer on the gene synthesis period is difficult to meet, meanwhile, under the condition that a large number of gene sequences to be synthesized exist, effective overall arrangement cannot be carried out, and the gene synthesis efficiency is reduced.

2. The invention provides a method for constructing a gene synthesis difficulty analysis model, which comprises the following steps of: at least three of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of forward repeat coverage area sum to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of reverse repeat coverage area sum to sequence length, number of consecutive repeat bases and number of polymers; in long-term gene synthesis, the sequence characteristics are closely related to the difficulty of the gene sequence, and the difficulty of the gene synthesis can be accurately estimated by selecting the sequence characteristics and the difficulty of the known sequence and utilizing a model constructed by a regression algorithm.

3. The invention provides a method for constructing a gene sequence difficulty analysis model, wherein the regression algorithm comprises Bayesian Ridge, a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoostingregression) or an extreme random forest regression (ExtraTreesregression); researches show that the difficulty of gene synthesis can be accurately estimated by using the sequence characteristics of the gene sequence with known sequence difficulty and the model constructed by the regression algorithm with the known sequence difficulty, and the gene synthesis period can be further predicted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a diagram of a sequence I in the calculation of the maximum forward repeat coverage area size in the present invention;

FIG. 2 is a sequence II of the calculation of the maximum forward repeat coverage area size in the present invention;

FIG. 3 is a sequence III diagram of the calculation of the maximum forward repeat coverage area size in the present invention;

FIG. 4 is a diagram of sequence four in the calculation of the forward maximal repetition to repetition coverage ratio in the present invention;

FIG. 5 is a diagram of sequence five in the calculation of the ratio of the sum of forward repeat coverage areas to the sequence length in the present invention;

FIG. 6 is a schematic diagram of the apparatus for analyzing difficulty of gene synthesis in example 2 of the present invention.

In fig. 1-5, thin solid lines indicate sequences, A, B, C represents a forward repeat coverage area, and D, E represents a forward repeat sequence.

Detailed Description

In the present invention, the terms of the sequence features are explained as follows:

sequence length refers to the total length of the sequence.

The GC content (GC%) of a sequence is the percentage of the total number of bases in the sequence, based on the sum of the number of bases G and the number of bases C in the sequence.

The maximum forward repeat coverage region size is the length of the region covered by the largest forward repeat among the regions covered by the forward repeats of the sequence (the forward repeats are equal to or greater than 8 bp). And if the interval between the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas is less than 20bp, taking the sum of the lengths of the two forward repeat (the forward repeat is more than or equal to 8bp) coverage areas as the maximum forward repeat coverage area size.

In the first sequence shown in fig. 1, taking forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is less than 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is greater than 20bp, and the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B is greater than the length of the forward repeat coverage area C, then the maximum forward repeat coverage area size is the sum of the lengths of the forward repeat coverage area a and the forward repeat coverage area B.

In the second sequence shown in fig. 2, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is > 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the length of the forward repeat coverage area C.

In the third sequence shown in fig. 3, taking the forward repeat as an example, the sequence only includes forward repeat coverage areas A, B and C, the interval between the forward repeat coverage area a and the forward repeat coverage area B is > 20bp, the interval between the forward repeat coverage area B and the forward repeat coverage area C is < 20bp, and the length of the forward repeat coverage area C > the length of the forward repeat coverage area a > the length of the forward repeat coverage area B, then the maximum forward repeat coverage area is the sum of the length of the forward repeat coverage area C and the length of the forward repeat coverage area B.

The forward maximum repetition to repetition coverage ratio means that the repetition coverage area is composed of a number of repetitions, and in the forward repetition coverage area is composed of a number of forward repetitions, and the sequence length of the maximum forward repetition is divided by the length of the repetition coverage area in which it is located.

As in the fourth sequence shown in fig. 4, taking the forward repeat as an example, only the forward repeat coverage area a is included in the sequence, including the forward repeat D and the forward repeat E, and if the length of the forward repeat D > the length of the forward repeat E, the ratio of the forward maximum repeat to the repeat coverage area is equal to the length of the forward repeat D/the length of the forward repeat coverage area a.

The ratio of the sum of forward repeat coverage areas to the length of the sequence is the sum of the lengths of all forward repeat coverage areas in the sequence divided by the sequence length. In the fifth sequence shown in fig. 5, taking forward repeat as an example, the sequence includes only the forward repeat coverage areas A, B and C, and the ratio of the sum of the forward repeat coverage areas to the sequence length is (length of the forward repeat coverage area a + length of the forward repeat coverage area B + length of the forward repeat coverage area C)/the sequence length.

The maximum reverse repeat coverage area size is calculated in the same way as the maximum forward repeat coverage area size, except that the reverse repeat is calculated.

The reverse maximal repetition to repeat coverage ratio is calculated in the same way as the forward maximal repetition to repeat coverage ratio, except that the reverse repetition is calculated.

The ratio of the sum of forward repeat coverage areas to the sequence length is calculated in the same way as the ratio of the sum of forward repeat coverage areas to the sequence length, except that the reverse repeat is calculated.

The number of consecutive repeats is the number of any one of bases A, T, C or G consecutive repeats in the sequence.

The number of polymers is the sum of the number of poly structures such as polyA, polyD, etc., present in the sequence.

Example 1 construction method of Gene sequence difficulty analysis model

The embodiment provides a method for constructing a gene sequence difficulty analysis model, which comprises the following steps:

(1) taking 500 different gene sequences with known synthesis periods (the gene sequences are provided by Jinzhi Biotechnology GmbH) as a modeling database;

(2) extracting sequence characteristics of the gene sequences in the database, wherein the extracted sequence characteristics comprise sequence length, sequence GC content, maximum forward repeat coverage area size, forward maximum repeat to repeat coverage area ratio, ratio of the sum of forward repeat coverage areas to the sequence length, maximum reverse repeat coverage area size, reverse maximum repeat to repeat coverage area ratio, ratio of the sum of reverse repeat coverage areas to the sequence length, the number of continuous repeat bases and the number of polymers; the sequence feature data of the gene sequences in the database is provided by Jinzhi Biotechnology, Inc.;

(3) and (3) establishing a quantitative prediction model by using a regression algorithm according to the sequence feature data obtained in the step (2) and the known sequence difficulty, wherein the regression algorithm model comprises a Bayesian ridge regression algorithm (Bayesian ridge), a linear regression algorithm (Linear regression), an elastic network (Elasticenet), a Support Vector Regression (SVR), a background gradient lifting regression (GBR), a random forest regression (random forest regression), a gradient lifting regression (GradientBoosting regression) or an extreme random forest regression (ExtraTreesRegister), and the regression algorithm selected in the embodiment is used for establishing the quantitative prediction model. With R²The final fitting result is R for the evaluation index of the quantitative prediction model²0.9, which indicates that the quantitative prediction model constructed in the present example has excellent prediction performance.

(4) And (3) predicting the sequence to be tested, extracting the sequence characteristics in the step (2) from the sequence to be tested, introducing the sequence characteristics into the quantitative prediction model constructed in the step (3), and calculating to obtain the synthesis difficulty of the sequence to be tested.

Example 2 prediction of Gene Synthesis cycle

In this embodiment, the difficulty of the sequence to be detected is predicted by using the quantitative prediction model constructed in embodiment 1, and the period of gene synthesis of the sequence to be detected is obtained according to the predicted difficulty.

Example 3 Gene sequence difficulty analysis device

The present embodiment provides a gene sequence difficulty analysis device, where the structure diagram of the device is shown in fig. 6, and the device includes:

further, the sequence feature extraction unit includes: a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area proportion extraction subunit, a forward repeat coverage area sum and sequence length proportion extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area proportion extraction subunit, a reverse repeat coverage area sum and sequence length proportion extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit;

the quantitative prediction model unit is used for establishing a quantitative prediction model by utilizing a regression algorithm according to the extracted sequence characteristics and the known sequence difficulty;

further, the quantitative prediction model unit includes: the regression algorithm comprises a Bayesian ridge (Bayesian ridge) subunit, a linear regression (Linear regression) subunit, an elastic network (ElasticNet) subunit, a Support Vector Regression (SVR) subunit, a background Gradient Boost Regression (GBR) subunit, a random forest regression (RandomForestReducer) subunit, a gradient boost regression (GradientBoostigReducer) subunit or an extreme random forest regression (ExtraTreesReducer) subunit.

The quantitative prediction model further comprises a detection unit for extracting the sequence characteristics of the sequence to be detected and then introducing the obtained sequence characteristics into the quantitative prediction model.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A method for constructing a gene sequence difficulty analysis model is characterized by comprising the following steps:

carrying out sequence feature extraction on the gene sequences in the database;

2. The construction method according to claim 1, wherein the extracted sequence features include: at least 3 of sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse repeat coverage area size, ratio of reverse maximum repeat to repeat coverage area, ratio of total reverse repeat coverage area to sequence length, number of consecutive repeat bases, and number of polymers.

3. The construction method according to claim 1 or 2, wherein the regression algorithm comprises a bayesian ridge regression algorithm, a linear regression algorithm, an elastic network, a support vector regression, a background gradient boost regression, a random forest regression, a gradient boost regression or an extreme random forest regression.

4. The construction method according to any one of claims 1 to 3, comprising: and extracting the sequence characteristics of the sequence to be detected, and then introducing the obtained sequence characteristics into the quantitative prediction model.

5. Use of a quantitative prediction model constructed by the construction method according to any one of claims 1 to 4 for predicting a gene synthesis cycle.

6. A method for predicting a gene synthesis cycle, comprising constructing a quantitative prediction model by the construction method according to any one of claims 1 to 4.

7. A gene sequence difficulty analysis device is characterized by comprising:

8. The apparatus of claim 7, wherein the sequence feature extraction unit comprises: at least 3 of a sequence length extraction subunit, a sequence GC content extraction subunit, a maximum forward repeat coverage area size extraction subunit, a forward maximum repeat and repeat coverage area ratio extraction subunit, a forward repeat coverage area sum and sequence length ratio extraction subunit, a maximum reverse repeat coverage area size extraction subunit, a reverse maximum repeat and repeat coverage area ratio extraction subunit, a reverse repeat coverage area sum and sequence length ratio extraction subunit, a continuous repeat base number extraction subunit and a polymer number extraction subunit.

9. The apparatus according to claim 7 or 8, wherein the quantitative prediction model unit comprises: the regression algorithm comprises a Bayesian ridge regression algorithm subunit, a linear regression algorithm subunit, an elastic network subunit, a support vector regression subunit, a background gradient boost regression subunit, a random forest regression subunit, a gradient boost regression subunit or an extreme random forest regression subunit.

10. The apparatus according to any one of claims 7 to 9, comprising a detection unit for extracting sequence features of a sequence to be tested and then introducing the obtained sequence features into the quantitative prediction model.