WO2021129035A1 - 一种基因序列合成难度分析模型的构建方法及其应用 - Google Patents

一种基因序列合成难度分析模型的构建方法及其应用 Download PDF

Info

Publication number
WO2021129035A1
WO2021129035A1 PCT/CN2020/119562 CN2020119562W WO2021129035A1 WO 2021129035 A1 WO2021129035 A1 WO 2021129035A1 CN 2020119562 W CN2020119562 W CN 2020119562W WO 2021129035 A1 WO2021129035 A1 WO 2021129035A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
coverage area
regression
gene
repeat
Prior art date
Application number
PCT/CN2020/119562
Other languages
English (en)
French (fr)
Inventor
赵文妍
段广有
丁砚书
方其
张艳
葛毅
廖国娟
Original Assignee
苏州金唯智生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州金唯智生物科技有限公司 filed Critical 苏州金唯智生物科技有限公司
Publication of WO2021129035A1 publication Critical patent/WO2021129035A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • This application relates to the field of biotechnology, in particular to a method for constructing a gene sequence synthesis difficulty analysis model and its application.
  • Gene synthesis refers to the use of biological methods to synthesize the required genes in vitro. It can not only modify existing genes, but also create genes that do not exist in nature, that is, "modified life” and "artificial life”. Since gene synthesis technology has opened up a new direction for humans to transform organisms, any field connected with genes requires artificial gene synthesis. In the foreseeable future, gene synthesis will play a huge role in the fields of life sciences, new energy, new materials, artificial life, nucleic acid vaccines, and biomedicine.
  • the technical problem to be solved by this application is to propose a method for constructing a gene sequence synthesis difficulty analysis model and its application.
  • the gene sequence synthesis difficulty analysis model constructed by the construction method can analyze the difficulty of gene sequence synthesis of different gene sequences. It is predicted that based on the predicted difficulty of gene sequence synthesis, it can provide customers with a more accurate gene synthesis cycle for sequence orders, and it is also conducive to the overall arrangement of gene synthesis companies and improve production efficiency.
  • a method for constructing a gene sequence synthesis difficulty analysis model including:
  • a regression algorithm is used to establish a quantitative prediction model by using the extracted sequence features and the difficulty of synthesizing the known sequence.
  • the known sequence synthesis difficulty and the sequence synthesis difficulty to be analyzed adopt the same characterization parameter or measurement standard, for example, the synthesis period under the same synthesis platform is used to characterize or measure.
  • the sequence synthesis difficulty can be measured or characterized by the length of its synthesis cycle, that is, the different genes with known sequence synthesis difficulty refer to different genes with known synthesis cycles.
  • the extracted sequence features include: sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeated coverage area, ratio of total forward repeat coverage area to sequence length, maximum reverse At least 3 of the size of the repeated coverage area, the ratio of the largest inverted repeat to the repeated coverage area, the ratio of the total inverted repeated coverage area to the sequence length, the number of consecutively repeated bases, and the number of polymers.
  • the sequence characteristics are sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repeat to repeat coverage area, ratio of total forward repeat coverage area to sequence length, and maximum reverse repeat coverage
  • the number of used different gene sequences with known sequence synthesis difficulty is ⁇ 500.
  • the regression algorithm includes Bayesian Ridge regression algorithm (BayesianRidge), linear regression algorithm (LinearRegression), elastic network (ElasticNet), support vector regression (SVR), background gradient boosting regression (GBR), random forest regression ( RandomForestRegressor), gradient boosting regression (GradientBoostingRegressor) or extreme random forest regression (ExtraTreesRegressor).
  • BayesianRidge Bayesian Ridge regression algorithm
  • LinearRegression linear regression algorithm
  • ElasticNet elastic network
  • SVR support vector regression
  • GRR background gradient boosting regression
  • RandomForestRegressor RandomForestRegressor
  • GradientBoostingRegressor gradient boosting regression
  • ExtraTreesRegressor extreme random forest regression
  • the construction method further includes: extracting the sequence features of the sequence to be tested, and then importing the obtained sequence features into the quantitative prediction model.
  • a method for predicting gene synthesis cycle includes the quantitative prediction model constructed by the above-mentioned construction method.
  • a gene synthesis difficulty analysis device which includes:
  • a database unit which is used to obtain a number of different gene sequences with known sequence synthesis difficulty
  • a sequence feature extraction unit which is used to perform sequence feature extraction on the gene sequence in the database unit
  • a quantitative prediction model unit which is used to establish a quantitative prediction model by using a regression algorithm to synthesize the sequence features with the known sequence.
  • sequence feature extraction unit includes: sequence length extraction subunit, sequence GC content extraction subunit, maximum forward repeat coverage area size extraction subunit, forward maximum repeat and repeated coverage area ratio extraction subunit, forward The ratio of the total repeat coverage area to the sequence length extracts the subunit, the maximum reverse repeat coverage area size extracts the subunit, the reverse maximum repeat to repeat coverage area ratio extracts the subunit, the reverse repeat coverage area total and the sequence length ratio extractor At least 3 of the unit, the number of consecutive repeating base extraction subunits, and the number of polymer extraction subunits.
  • the sequence feature extraction unit includes: sequence length extraction subunit, sequence GC content extraction subunit, maximum forward repeat coverage area size extraction subunit, forward maximum repeat and repeat coverage area ratio extraction subunit, forward The ratio of the total repeat coverage area to the sequence length extracts the subunit, the maximum reverse repeat coverage area size extracts the subunit, the reverse maximum repeat to repeat coverage area ratio extracts the subunit, the reverse repeat coverage area total and the sequence length ratio extractor Unit, the number of consecutive repeating base extraction subunit and the number of polymer extraction subunit.
  • the quantitative prediction model unit includes a regression algorithm subunit, and the regression algorithm subunit is selected from the Bayesian Ridge regression algorithm (BayesianRidge) subunit, the linear regression algorithm (LinearRegression) subunit, and the elastic network (ElasticNet) subunit.
  • Unit support vector regression (SVR) subunit, background gradient boosting regression (GBR) subunit, random forest regression (RandomForestRegressor) subunit, gradient boosting regression (GradientBoostingRegressor) subunit and extreme random forest regression (ExtraTreesRegressor) subunit group.
  • the gene synthesis difficulty analysis device further includes a detection unit for extracting sequence features of the sequence to be tested, and then importing the extracted sequence features into the quantitative prediction model.
  • a method for constructing a gene sequence synthesis difficulty analysis model provided by this application, which includes: taking several different gene sequences with known sequence synthesis difficulty as a modeling database; and performing sequence characterization on the gene sequences in the database Extraction; It is difficult to synthesize the extracted sequence features with the known sequence, and a regression algorithm is used to establish a quantitative prediction model; the inventor found in the production process that the difficulty of synthesizing the gene sequence to be synthesized cannot be predicted, which makes it difficult to satisfy customers for gene synthesis At the same time, in the presence of a large number of gene sequences to be synthesized, effective overall arrangements cannot be made, which reduces the efficiency of gene synthesis.
  • the inventor found that the sequence characteristics of gene sequences whose known sequences are difficult to synthesize are the same as those of gene sequences. Given the difficulty of sequence synthesis, the model constructed by the regression algorithm can accurately predict the synthesis difficulty of the gene sequence to be synthesized, thereby further predicting the cycle of gene synthesis.
  • a method for constructing a gene synthesis difficulty analysis model provided by this application, wherein the extracted sequence features include: sequence length, sequence GC content, maximum forward repeat coverage area size, maximum forward repeat and repetition coverage area ratio, The ratio of the total coverage area of forward repeats to the length of the sequence, the size of the largest inverted repeat coverage area, the ratio of the largest inverted repeat coverage to the repeat coverage area, the ratio of the total inverted repeat coverage area to the sequence length, the number of consecutively repeated bases and aggregation At least three of the number of objects; in the long-term gene synthesis, it is found that the above sequence characteristics are closely related to the difficulty of gene sequence synthesis. By selecting the above sequence characteristics and the difficulty of synthesis of known sequences, the model constructed by regression algorithm can be used. Accurately estimate the difficulty of gene synthesis.
  • a method for constructing a gene sequence synthesis difficulty analysis model includes BayesianRidge, Linear Regression, ElasticNet, Support Vector Regression (SVR), Background Gradient Boosting Regression ( GBR), random forest regression (RandomForestRegressor), gradient boosting regression (GradientBoostingRegressor) or extreme random forest regression (ExtraTreesRegressor); research has found that the sequence characteristics of gene sequences with known sequence synthesis difficulty and the known sequence synthesis difficulty The model constructed by the regression algorithm can accurately predict the difficulty of gene synthesis and further predict the gene synthesis cycle.
  • Figure 1 is a schematic diagram of sequence one in the calculation of the maximum forward repeated coverage area size in this application;
  • Fig. 2 is a schematic diagram of Sequence 2 in the calculation of the maximum forward repeated coverage area size in this application;
  • Figure 3 is a schematic diagram of sequence three in the calculation of the maximum forward repeated coverage area size in this application.
  • Figure 4 is a schematic diagram of Sequence Four in the calculation of the maximum forward repetition and the ratio of repetitive coverage areas in this application;
  • FIG. 5 is a schematic diagram of sequence five in the calculation of the ratio of the sum of the forward repeated coverage area to the sequence length in this application;
  • Fig. 6 is a structural diagram of a gene synthesis difficulty analysis device in Example 2 of the present application.
  • the thin solid line represents the sequence
  • A, B, and C represent the positive repetitive coverage area
  • D and E represent the positive repetitive sequence.
  • the sequence length refers to the total length of the sequence.
  • GC content refers to the percentage of the sum of the number of bases G and the number of bases C in the sequence to the total number of bases in the sequence.
  • the maximum forward repeat coverage area size refers to the length of the area covered by the largest forward repeat in the area covered by the forward repeat of the sequence (the forward repeat ⁇ 8bp). If the interval between the coverage areas of two forward repeats (the forward repeat ⁇ 8bp) is less than 20 bp, the sum of the lengths of the coverage areas of the two forward repeats (the forward repeat ⁇ 8 bp) is taken as the maximum forward direction Repeat coverage area size.
  • the sequence only includes the forward repeat coverage areas A, B, and C, and the interval between the forward repeat coverage area A and the forward repeat coverage area B is less than 20 bp, The interval between the forward repeated coverage area B and the forward repeated coverage area C>20bp, and the sum of the length of the forward repeated coverage area A and the forward repeated coverage area B>the length of the forward repeated coverage area C, then the maximum forward repeated coverage The area size is the sum of the lengths of the forward repeated coverage area A and the forward repeated coverage area B.
  • the sequence includes only forward repeat coverage areas A, B, and C.
  • the interval between the forward repeat coverage area A and the forward repetition coverage area B is> 20 bp.
  • the sequence only includes the forward repeat coverage areas A, B, and C.
  • the interval between the forward repeat coverage area A and the forward repetition coverage area B is> 20 bp.
  • the interval between the forward repeated coverage area B and the forward repeated coverage area C is less than 20bp, and the length of the forward repeated coverage area C>the length of the forward repeated coverage area A>the length of the forward repeated coverage area B, then the maximum forward repeated coverage The area is the sum of the length of the forward repeated coverage area C and the length of the forward repeated coverage area B.
  • the ratio of the maximum forward repetition to the repetitive coverage area means that the repetitive coverage area is composed of several repetitions. In the positive repetitive coverage area, it is composed of several positive repetitions. The sequence length of the maximum forward repetition is divided by the repetitive coverage area in which it is located. length.
  • sequence 4 shown in Figure 4 take the forward repetition as an example.
  • the ratio of the sum of the forward repeated coverage area to the sequence length refers to the sum of the lengths of all the forward repeated coverage areas in the sequence divided by the sequence length.
  • the calculation method of the maximum reverse repetition coverage area size is the same as that of the maximum forward repetition coverage area size, except that the calculation is reverse repetition.
  • the calculation method of the reverse maximum repetition and repeated coverage area ratio and the forward maximum repetition and repeated coverage area ratio are the same, the only difference is that the calculation is reverse repetition.
  • the calculation method of the ratio of the total forward repeated coverage area to the sequence length is the same as the ratio of the total forward repeated coverage area to the sequence length, except that the calculation is reverse repetition.
  • the number of consecutively repeated bases refers to the number of consecutively repeated bases A, T, C, or G in the sequence.
  • the number of polymers refers to the sum of the number of poly structures appearing in the sequence, such as polyA, polyD, etc.
  • This embodiment provides a method for constructing a gene sequence synthesis difficulty analysis model, which includes the following steps:
  • sequence feature extraction on the gene sequences in the above-mentioned database.
  • the extracted sequence features include sequence length, sequence GC content, maximum forward repeat coverage area size, ratio of forward maximum repetition to repetitive coverage area, and forward repetition coverage The ratio of the total area to the sequence length, the size of the largest inverted repeat coverage area, the ratio of the largest inverted repeat coverage to the repeated coverage area, the ratio of the total inverted repeat coverage area to the sequence length, the number of consecutively repeated bases and the number of polymers;
  • sequence feature data of the gene sequence in the above database is provided by Jinweizhi Biotechnology Co., Ltd.;
  • a regression algorithm is used to establish a quantitative prediction model, the regression algorithm model including Bayesian Ridge regression algorithm (Bayesian Ridge), linear regression algorithm (LinearRegression), ElasticNet (ElasticNet), Support Vector Regression (SVR), Background Gradient Boosting Regression (GBR), Random Forest Regressor (RandomForestRegressor), Gradient Boosting Regressor or Extreme Random Forest Regressor (ExtraTreesRegressor), in this implementation
  • Synthesis difficulty prediction of the sequence to be synthesized extract the sequence features of the sequence to be synthesized as listed in step (2), and then import the sequence features of the sequence to be synthesized into the quantitative prediction constructed in step (3) above In the model, the synthesis difficulty of the sequence to be synthesized is calculated.
  • the quantitative prediction model constructed in Example 1 is used to predict the synthesis difficulty of the sequence to be synthesized, and the gene synthesis cycle of the sequence to be synthesized is obtained according to the predicted synthesis difficulty.
  • Embodiment 3 Gene sequence synthesis difficulty analysis device
  • This embodiment provides a device for analyzing the difficulty of gene sequence synthesis.
  • the structure of the device is shown in FIG. 6, which includes:
  • the database unit is used to obtain a number of different gene sequences with known sequence synthesis difficulty
  • the sequence feature extraction unit is used to extract the sequence feature of the gene sequence in the database unit;
  • sequence feature extraction unit includes: sequence length extraction subunit, sequence GC content extraction subunit, maximum forward repeat coverage area size extraction subunit, forward maximum repeat and repeated coverage area ratio extraction subunit, forward The ratio of the total repeat coverage area to the sequence length extracts the subunit, the maximum reverse repeat coverage area size extracts the subunit, the reverse maximum repeat to repeat coverage area ratio extracts the subunit, the reverse repeat coverage area total and the sequence length ratio extractor Unit, the number of consecutive repeating base extraction subunit and the number of polymer extraction subunit;
  • the quantitative prediction model unit is used to establish a quantitative prediction model by using a regression algorithm to synthesize the extracted sequence features with the known sequence;
  • the quantitative prediction model unit includes a regression algorithm subunit, and the regression algorithm subunit is selected from the Bayesian Ridge regression (Bayesian Ridge) subunit, the linear regression algorithm (LinearRegression) subunit, and the elastic network (ElasticNet) subunit. , Support Vector Regression (SVR) subunit, background gradient boosting regression (GBR) subunit, random forest regression (RandomForestRegressor) subunit, gradient boosting regression (GradientBoostingRegressor) subunit and extreme random forest regression (ExtraTreesRegressor) subunit .
  • SVR Support Vector Regression
  • GRR background gradient boosting regression
  • GBR background gradient boosting regression
  • RandomForestRegressor random forest regression
  • GRR gradient boosting regression
  • GradientBoostingRegressor gradient boosting regression
  • ExtraTreesRegressor extreme random forest regression
  • it also includes a detection unit for extracting sequence features of the sequence to be tested, and then importing the obtained sequence features into the quantitative prediction model.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基因序列合成难度分析模型的构建方法及其应用,所述构建方法包括:取已知序列合成难度的不同基因序列若干,作为建模的数据库;对所述数据库中的基因序列进行序列特征提取;将所述序列特征与所述已知序列合成难度利用回归算法建立定量预测模型。通过所构建的定量预测模型,可以为客户提供较为精准的序列订单的基因合成周期,同时也有利于基因合成公司的统筹安排,提高生产效率。

Description

一种基因序列合成难度分析模型的构建方法及其应用 技术领域
本申请涉及生物技术领域,具体涉及一种基因序列合成难度分析模型的构建方法及其应用。
背景技术
随着计算机、生物信息、基因测序等技术的不断发展,使全基因乃至基因组人工合成成为可能。基因合成是指运用生物学方法在体外合成所需基因的技术,它不仅可以对已有基因进行改造,还能创造出自然界中不存在的基因,即“改造生命”和“人造生命”。由于基因合成技术为人类改造生物开辟了一个全新的方向,任何与基因相联系的领域都需要进行人工基因的合成。在可预计的将来,基因合成将在生命科学、新能源、新材料、人工生命、核酸疫苗以及生物医药等领域中发挥巨大作用。
目前,为了快速的、高通量的进行基因合成,提供了工业化的基因合成方法,以期满足不断增长的研究院所或企业关于基因合成的需求。现有的工业化基因合成方法大致有7个模块化步骤,分别为PCR扩增、连接转化、挑取单克隆摇菌、菌液PCR鉴定、质粒抽提、Sanger测序、PCR扩增正确克隆,最终得到与预期一致的PCR产物片段。由于上述方法步骤繁多、通量低,整体流程的运行时间超过72小时,成本高。为了提高基因合成效率,中国专利文献CN107760672A公开的一种基于二代测序技术的工业化基因合成方法,快速简便,效率高。
随着基因合成的需求日益增长,基因合成公司会同时接到来自不同客户的大量的基因序列合成订单,而这些待合成的基因序列千差万别,基因序列的难度不同,无法预估基因序列合成的生产周期,即使采用标准化的工业化基因合成方法,也无法为客户提供基因合成的生产周期,同时由于待合成基因序列的周期不确定性,无法进行有效的统筹安排,降低了基因合成的效率。然而,还未有关于不同基因序列的基因序列难度分析的相关报道。
发明内容
因此,本申请要解决的技术问题在于提出一种基因序列合成难度分析模型的构建方法及其应用,通过所述构建方法构建的基因序列合成难度分析模型可以对不同基因序列的基因序列合成难度进行预测,依据所预测的基因序列合成难度,可以为客户提供较为精准的序列订单的基因合成周期,同时也有利于基因合成公司的统筹安排,提高生产效率。
为解决上述技术问题,本申请提供了如下技术方案:
一种基因序列合成难度分析模型的构建方法,包括:
取已知序列合成难度的不同基因序列若干,作为建模的数据库;
对所述数据库中的基因序列进行序列特征提取;
将所提取的序列特征与所述已知序列合成难度利用回归算法建立定量预测模型。
上述构建方法中,已知的序列合成难度与待分析的序列合成难度采用同一表征参数或衡量标准,例如,采用同一合成平台下的合成周期来表征或衡量。
在具体实施方案中,所述序列合成难度可通过其合成周期的长短来衡量或表征,即,所述已知序列合成难度的不同基因是指已知合成周期的不同基因。
进一步地,所提取的序列特征包括:序列长度、序列GC含量、最大正向重复覆盖区域大小、正向最大重复与重复覆盖区比例、正向重复覆盖区域总和与序列长度的比例、最大反向重复覆盖区域大小、反向最大重复与重复覆盖区比例、反向重复覆盖区域总和与序列长度的比例、连续重复碱基个数和聚合物个数中的至少3个。
优选地,所述序列特征为序列长度、序列GC含量、最大正向重复覆盖区域大小、正向最大重复与重复覆盖区比例、正向重复覆盖区域总和与序列长度的比例、最大反向重复覆盖区域大小、反向最大重复与重复覆盖区比例、反向重复覆盖区域总和与序列长度的比例、连续重复碱基个数和聚合物个数。
优选地,所采用的已知序列合成难度的不同基因序列的数目≥500条。
进一步地,所述回归算法包括贝叶斯岭回归算法(BayesianRidge)、线性回归算法(LinearRegression)、弹性网络(ElasticNet)、支持向量回归(SVR)、背景梯度提升回归(GBR)、随机森林回归(RandomForestRegressor)、梯度提升回归(GradientBoostingRegressor)或极端随机森林回归(ExtraTreesRegressor)。
进一步地,所述构建方法还包括:对待测序列的序列特征进行提取,然后将所得的序列特征导入所述定量预测模型中。
一种如上述构建方法构建得到的定量预测模型。
一种基因合成周期预测方法,其包括利用上述构建方法构建得到的定量预测模型。
一种基因合成难度分析装置,其包括:
数据库单元,其用于获取已知序列合成难度的不同基因序列若干;
序列特征提取单元,其用于对数据库单元中的基因序列进行序列特征提取;
定量预测模型单元,其用于将所述序列特征与所述已知序列合成难度利用回归算法建立定量预测模型。
进一步地,所述序列特征提取单元包括:序列长度提取子单元、序列GC含量提取子单元、最大正向重复覆盖区域大小提取子单元、正向最大重复与重复覆盖区比例提取子单元、正向重复覆盖区域总和与序列长度的比例提取子单元、最大反向重复覆盖区域大小提取子单元、反向最大重复与重复覆盖区比例提取子单元、反向重复覆盖区域总和与序列长度的比例提取子单元、连续重复碱基个数提取子单元和聚合物个数提取子单元中至少3个。
优选地,所述序列特征提取单元包括:序列长度提取子单元、序列GC含量提取子单元、最大正向重复覆盖区域大小提取子单元、正向最大重复与重复覆盖区比例提取子单元、正向重复覆盖区域总和与序列长度的比例提取子单元、最大反向重复覆盖区域大小提取子单元、反向最大重复与重复覆盖区比例提取子单元、反向重复覆盖区域总和与序列长度的比例提取子单元、连续重复碱基个数提取子单元和聚合物个数提取子单元。
进一步地,所述定量预测模型单元包括回归算法子单元,所述回归算 法子单元选自由贝叶斯岭回归算法(BayesianRidge)子单元、线性回归算法(LinearRegression)子单元、弹性网络(ElasticNet)子单元、支持向量回归(SVR)子单元、背景梯度提升回归(GBR)子单元、随机森林回归(RandomForestRegressor)子单元、梯度提升回归(GradientBoostingRegressor)子单元和极端随机森林回归(ExtraTreesRegressor)子单元组成的组。
进一步地,所述基因合成难度分析装置还包括检测单元,所述检测单元用于提取待测序列的序列特征,然后将所提取的序列特征导入所述定量预测模型中。
本申请技术方案,具有如下优点:
1.本申请提供的一种基因序列合成难度分析模型的构建方法,其包括:取已知序列合成难度的不同基因序列若干,作为建模的数据库;对所述数据库中的基因序列进行序列特征提取;将提取的序列特征与所述已知序列合成难度,利用回归算法建立定量预测模型;发明人在生产过程中发现,待合成的基因序列的合成难度无法预测,进而难以满足客户对于基因合成周期的需求,同时在大量待合成基因序列存在的情况下,无法进行有效的统筹安排,降低了基因合成的效率,因此,发明人研究发现,将已知序列合成难度的基因序列的序列特征与已知序列合成难度,利用回归算法构建的模型,可以准确地预估待合成基因序列的合成难度,从而进一步预测基因合成的周期。
2.本申请提供的一种基因合成难度分析模型的构建方法,其中所提取的序列特征包括:序列长度、序列GC含量、最大正向重复覆盖区域大小、 正向最大重复与重复覆盖区比例、正向重复覆盖区域总和与序列长度的比例、最大反向重复覆盖区域大小、反向最大重复与重复覆盖区比例、反向重复覆盖区域总和与序列长度的比例、连续重复碱基个数和聚合物个数中的至少三个;在长期的基因合成中发现,上述的序列特征与基因序列的合成难度关系密切,通过选择上述序列特征与已知序列合成难度,利用回归算法构建的模型,可以准确地预估基因合成的难度。
3.本申请提供的一种基因序列合成难度分析模型的构建方法,所述回归算法包括BayesianRidge、线性回归算法(LinearRegression)、弹性网络(ElasticNet)、支持向量回归(SVR)、背景梯度提升回归(GBR)、随机森林回归(RandomForestRegressor)、梯度提升回归(GradientBoostingRegressor)或极端随机森林回归(ExtraTreesRegressor);研究发现,将已知序列合成难度的基因序列的序列特征与已知序列合成难度,利用上述的回归算法构建的模型,可以准确地预估基因合成的难度,进一步预测基因合成周期。
附图说明
为了更清楚地说明本申请具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请中最大正向重复覆盖区域大小计算中序列一示意图;
图2是本申请中最大正向重复覆盖区域大小计算中序列二示意图;
图3是本申请中最大正向重复覆盖区域大小计算中序列三示意图;
图4是本申请中正向最大重复与重复覆盖区比例计算中序列四示意图;
图5是本申请中正向重复覆盖区域总和与序列长度的比例计算中序列五示意图;
图6是本申请中实施例2基因合成难度分析装置的结构图。
图1-5中,细实线表示序列,A、B、C代表正向重复覆盖区域,D、E代表正向重复序列。
具体实施方式
在本申请中,所述序列特征的术语解释如下:
序列长度是指,序列的总长度。
序列GC含量(GC%)是指,序列中碱基G的个数和碱基C的个数之和占序列中总碱基数的百分比。
最大正向重复覆盖区域大小是指,在序列的正向重复(所述正向重复≥8bp)覆盖的区域中,取最大的正向重复覆盖的区域的长度。若两个正向重复(所述正向重复≥8bp)覆盖区域之间的间隔小于20bp,则取两个正向重复(所述正向重复≥8bp)覆盖区域的长度之和为最大正向重复覆盖区域大小。
如图1中所示的序列一中,以正向重复为例,序列上仅包括正向重复覆盖区域A、B和C,正向重复覆盖区域A和正向重复覆盖区域B的间隔<20bp,正向重复覆盖区域B和正向重复覆盖区域C的间隔>20bp,且正向重复覆盖区域A和正向重复覆盖区域B的长度之和>正向重复覆盖区域C 的长度,则最大正向重复覆盖区域大小为正向重复覆盖区域A和正向重复覆盖区域B的长度之和。
如图2中所示序列二中,以正向重复为例,序列上仅包括正向重复覆盖区域A、B和C,正向重复覆盖区域A和正向重复覆盖区域B的间隔>20bp,正向重复覆盖区域B和正向重复覆盖区域C的间隔>20bp,且正向重复覆盖区域C长度>正向重复覆盖区域A的长度>正向重复覆盖区域B的长度,则最大正向重复覆盖区域为正向重复覆盖区域C的长度。
如图3中所示序列三中,以正向重复为例,序列上仅包括正向重复覆盖区域A、B和C,正向重复覆盖区域A和正向重复覆盖区域B的间隔>20bp,正向重复覆盖区域B和正向重复覆盖区域C的间隔<20bp,且正向重复覆盖区域C的长度>正向重复覆盖区域A的长度>正向重复覆盖区域B的长度,则最大正向重复覆盖区域为正向重复覆盖区域C的长度和正向重复覆盖区域B的长度之和。
正向最大重复与重复覆盖区比例是指,重复覆盖区域由若干重复组成,在正向重复覆盖区域中由若干正向重复组成,取最大正向重复的序列长度除以其所在的重复覆盖区长度。
如图4中所示序列四中,以正向重复为例,在序列中,仅包括正向重复覆盖区域A,包括正向重复D和正向重复E,若正向重复D的长度>正向重复E的长度,则正向最大重复与重复覆盖区比例=正向重复D的长度/正重复覆盖区域A的长度。
正向重复覆盖区域总和与序列长度的比例是指,取序列中的所有正向重复覆盖区域的长度总和除以序列长度。如图5所示序列五中,以正向重 复为例,序列上仅包括正向重复覆盖区域A、B和C,则正向重复覆盖区域总和与序列长度的比例=(正向重复覆盖区域A的长度+正向重复覆盖区域B的长度+正向重复覆盖区域C的长度)/序列长度。
最大反向重复覆盖区域大小与最大正向重复覆盖区域大小的计算方式相同,区别仅在于计算的是反向重复。
反向最大重复与重复覆盖区比例与正向最大重复与重复覆盖区比例的计算方式相同,区别仅在于计算的是反向重复。
正向重复覆盖区域总和与序列长度的比例与正向重复覆盖区域总和与序列长度的比例的计算方式相同,区别仅在于计算的是反向重复。
连续重复碱基个数是指,序列上碱基A、T、C或G任意一个连续重复的个数。
聚合物个数是指,序列中出现的poly结构的个数之和,poly结构如polyA、polyD等。
实施例1基因序列合成难度分析模型的构建方法
本实施例提供了一种基因序列合成难度分析模型的构建方法,包括如下步骤:
(1)取已知合成周期的不同基因序列500个(所述基因序列由金唯智生物科技有限公司提供),作为建模的数据库;
(2)对上述数据库中的基因序列进行序列特征提取,所提取的序列特征包括序列长度、序列GC含量、最大正向重复覆盖区域大小、正向最大重复与重复覆盖区比例、正向重复覆盖区域总和与序列长度的比例、最大反向重复覆盖区域大小、反向最大重复与重复覆盖区比例、反向重复覆盖区 域总和与序列长度的比例、连续重复碱基个数和聚合物个数;上述数据库中的基因序列的序列特征数据由金唯智生物科技有限公司提供;
(3)将步骤(2)中获得的序列特征数据与所述已知序列难度,利用回归算法,建立定量预测模型,所述回归算法模型包括贝叶斯岭回归算法(BayesianRidge)、线性回归算法(LinearRegression)、弹性网络(ElasticNet)、支持向量回归(SVR)、背景梯度提升回归(GBR)、随机森林回归(RandomForestRegressor)、梯度提升回归(GradientBoostingRegressor)或极端随机森林回归(ExtraTreesRegressor),在本实施例中选择使用的回归算法随机森林回归(RandomForestRegressor)建立定量预测模型。以R 2为所述定量预测模型的评价指标,最终拟合结果为R 2=0.9,表明本实施例构建的定量预测模型预测性能优良。
(4)待合成序列的合成难度预测:提取待合成序列的如步骤(2)中所列举的序列特征,然后将所提取的待合成序列的序列特征导入上述步骤(3)中构建的定量预测模型中,计算得到待合成序列的合成难度。
实施例2基因合成周期的预测
本实施例利用实施例1中构建的定量预测模型对待合成序列的合成难度进行预测,依据预测的合成难度获得待合成序列的基因合成的周期。
实施例3基因序列合成难度分析装置
本实施例提供了一种基因序列合成难度分析装置,所述装置结构图如图6所示,其包括:
数据库单元,用于获取已知序列合成难度的不同基因序列若干;
序列特征提取单元,用于对数据库单元中的基因序列进行序列特征提 取;
进一步地,所述序列特征提取单元包括:序列长度提取子单元、序列GC含量提取子单元、最大正向重复覆盖区域大小提取子单元、正向最大重复与重复覆盖区比例提取子单元、正向重复覆盖区域总和与序列长度的比例提取子单元、最大反向重复覆盖区域大小提取子单元、反向最大重复与重复覆盖区比例提取子单元、反向重复覆盖区域总和与序列长度的比例提取子单元、连续重复碱基个数提取子单元和聚合物个数提取子单元;
定量预测模型单元,用于将所提取的序列特征与所述已知序列合成难度利用回归算法建立定量预测模型;
进一步地,所述定量预测模型单元包括回归算法子单元,所述回归算法子单元选自由贝叶斯岭回归(BayesianRidge)子单元、线性回归算法(LinearRegression)子单元、弹性网络(ElasticNet)子单元、支持向量回归(SVR)子单元、背景梯度提升回归(GBR)子单元、随机森林回归(RandomForestRegressor)子单元、梯度提升回归(GradientBoostingRegressor)子单元和极端随机森林回归(ExtraTreesRegressor)子单元组成的组。
进一步的,还包括检测单元,用于将待测序列的序列特征进行提取,然后将所得的序列特征导入所述定量预测模型中。
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本申请创造的保 护范围之中。

Claims (10)

  1. 一种基因序列合成难度分析模型的构建方法,其包括:
    取已知序列合成难度的不同基因序列若干,作为建模的数据库;
    对所述数据库中的基因序列进行序列特征提取;
    将所提取的序列特征与所述已知序列合成难度利用回归算法建立定量预测模型。
  2. 根据权利要求1所述的构建方法,其中,所提取的序列特征包括:序列长度、序列GC含量、最大正向重复覆盖区域大小、正向最大重复与重复覆盖区比例、正向重复覆盖区域总和与序列长度的比例、最大反向重复覆盖区域大小、反向最大重复与重复覆盖区比例、反向重复覆盖区域总和与序列长度的比例、连续重复碱基个数和聚合物个数中的至少3个。
  3. 根据权利要求1或2所述的构建方法,其中,所述回归算法包括贝叶斯岭回归算法、线性回归算法、弹性网络、支持向量回归、背景梯度提升回归、随机森林回归、梯度提升回归或极端随机森林回归。
  4. 根据权利要求1-3任一项所述的构建方法,其进一步包括:对待测序列的序列特征进行提取,然后将所得的序列特征导入所述定量预测模型中。
  5. 通过如权利要求1-4任一项所述的构建方法构建得到的定量预测模型在预测基因合成周期中的用途。
  6. 一种基因合成周期预测方法,其包括利用权利要求1-4任一项所述的构建方法构建得到的定量预测模型。
  7. 一种基因序列难度分析装置,其包括:
    数据库单元,其用于获取已知序列合成周期的不同基因序列若干;
    序列特征提取单元,其用于对数据库单元中的基因序列进行序列特征提取;
    定量预测模型单元,其用于将所述序列特征与所述已知序列合成周期利用回归算法建立定量预测模型。
  8. 根据权利要求7所述的装置,其中,所述序列特征提取单元包括:序列长度提取子单元、序列GC含量提取子单元、最大正向重复覆盖区域大小提取子单元、正向最大重复与重复覆盖区比例提取子单元、正向重复覆盖区域总和与序列长度的比例提取子单元、最大反向重复覆盖区域大小提取子单元、反向最大重复与重复覆盖区比例提取子单元、反向重复覆盖区域总和与序列长度的比例提取子单元、连续重复碱基个数提取子单元和聚合物个数提取子单元中至少3个。
  9. 根据权利要求7或8所述的装置,其中,所述定量预测模型单元包括回归算法子单元,所述回归算法子单元选自由贝叶斯岭回归算法子单元、线性回归算法子单元、弹性网络子单元、支持向量回归子单元、背景梯度提升回归子单元、随机森林回归子单元、梯度提升回归子单元和极端随机森林回归子单元组成的组。
  10. 根据权利要求7-9任一项所述的装置,其进一步包括检测单元,所述检测单元用于提取待测序列的序列特征,然后将所提取的序列特征导入所述定量预测模型中。
PCT/CN2020/119562 2019-12-23 2020-09-30 一种基因序列合成难度分析模型的构建方法及其应用 WO2021129035A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911337248.6A CN111192629A (zh) 2019-12-23 2019-12-23 一种基因序列难度分析模型的构建方法及其应用
CN201911337248.6 2019-12-23

Publications (1)

Publication Number Publication Date
WO2021129035A1 true WO2021129035A1 (zh) 2021-07-01

Family

ID=70707430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119562 WO2021129035A1 (zh) 2019-12-23 2020-09-30 一种基因序列合成难度分析模型的构建方法及其应用

Country Status (2)

Country Link
CN (1) CN111192629A (zh)
WO (1) WO2021129035A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192629A (zh) * 2019-12-23 2020-05-22 苏州金唯智生物科技有限公司 一种基因序列难度分析模型的构建方法及其应用
CN116705176B (zh) * 2023-06-27 2024-08-20 苏州君跻基因科技有限公司 基因合成序列合成难度的分析方法、系统以及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599615A (zh) * 2016-11-30 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种预测miRNA靶基因的序列特征分析方法
CN108133122A (zh) * 2016-12-01 2018-06-08 深圳华大基因股份有限公司 基因聚类方法和基于该方法的宏基因组组装方法和装置
CN108614955A (zh) * 2018-05-04 2018-10-02 吉林大学 一种基于序列组成,结构信息及理化特征的lncRNA鉴定方法
CN110517731A (zh) * 2019-10-23 2019-11-29 上海思路迪医学检验所有限公司 基因检测质量监控数据处理方法和系统
CN111192629A (zh) * 2019-12-23 2020-05-22 苏州金唯智生物科技有限公司 一种基因序列难度分析模型的构建方法及其应用

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887546B (zh) * 2019-01-15 2019-12-27 明码(上海)生物科技有限公司 基于二代测序的单基因或多基因拷贝数检测系统及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599615A (zh) * 2016-11-30 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种预测miRNA靶基因的序列特征分析方法
CN108133122A (zh) * 2016-12-01 2018-06-08 深圳华大基因股份有限公司 基因聚类方法和基于该方法的宏基因组组装方法和装置
CN108614955A (zh) * 2018-05-04 2018-10-02 吉林大学 一种基于序列组成,结构信息及理化特征的lncRNA鉴定方法
CN110517731A (zh) * 2019-10-23 2019-11-29 上海思路迪医学检验所有限公司 基因检测质量监控数据处理方法和系统
CN111192629A (zh) * 2019-12-23 2020-05-22 苏州金唯智生物科技有限公司 一种基因序列难度分析模型的构建方法及其应用

Also Published As

Publication number Publication date
CN111192629A (zh) 2020-05-22

Similar Documents

Publication Publication Date Title
WO2021129035A1 (zh) 一种基因序列合成难度分析模型的构建方法及其应用
Schiebinger et al. Reconstruction of developmental landscapes by optimal-transport analysis of single-cell gene expression sheds light on cellular reprogramming
CN112908414B (zh) 一种大规模单细胞分型方法、系统及存储介质
Galitsyna et al. Single-cell Hi-C data analysis: safety in numbers
CN104504288A (zh) 基于多向支持向量聚类的非线性多阶段间歇过程软测量方法
CN107506614A (zh) 一种基于Illumina的转录组测序数据和PeakCalling方法的细菌ncRNA预测方法
Chadly et al. Reconstructing cell histories in space with image-readable base editor recording
CN106471509A (zh) 用于组装来自一个或多个生物体的染色体段的方法、设备和计算机程序
Tuggle et al. Methods for transcriptomic analyses of the porcine host immune response: application to Salmonella infection using microarrays
Meyer et al. Modeling methylation patterns with long read sequencing data
Lasri et al. Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation
Gulati et al. Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics
Cordes et al. Multi-omic analyses in immune cell development with lessons learned from T cell development
Al-Ghazawi Differential Equation Modeling of Cell Population Dynamics in Skeletal Muscle Regeneration from Single-Cell Transcriptomic Data
Wang et al. Single-cell phylodynamic inference of tissue development and tumor evolution with scPhyloX
Moussa Computational cell cycle analysis of single cell RNA-Seq data
Som Bioinformatics strategies for stem cell research
Lasri Doukkali et al. Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation
Li et al. A Network Propagation Based Approach for Measuring Cell-Cell Similarity
Arnob et al. Advances agglomerative clustering technique for phylogenetic classification
Cole Machine learning methods for next generation sequencing data: applications to MLL-AF4 leukemia and demographic inference
Hofmann 3D organization of eukaryotic and prokaryotic genomes
Menon et al. COMPARISON OF COMPUTATIONAL TOOLS FOR DIFFERENTIAL GENE EXPRESSION ANALYSIS OF RNA SEQUENCING AND SINGLE-CELL RNA SEQUENCING DATA
Saeys dyngen is a multi-modal simulator for spearheading future single-cell omics analyses
Hia et al. A Differential Evolution-based Pseudotime Estimation Method for Single-cell Data.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20906143

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20906143

Country of ref document: EP

Kind code of ref document: A1