KR20210037611A

KR20210037611A - Codon optimization

Info

Publication number: KR20210037611A
Application number: KR1020207035094A
Authority: KR
Inventors: 롱 판
Original assignee: 난징 진스크립트 바이오테크 컴퍼니 리미티드
Priority date: 2018-07-30
Filing date: 2019-07-30
Publication date: 2021-04-06
Also published as: US20210366574A1; EP3830830A4; WO2020024917A1; SG11202011455SA; CN112513989B; CN112513989A; TW202008379A; JP2021532439A; EP3830830A1; TWI802728B

Abstract

숙주에서 단백질 발현을 위한 핵산 서열을 최적화하기 위한 예시적인 컴퓨터-구현 방법은 하기를 포함한다: a) 초기 모집단 세트를 수신하는 단계로서, 여기서 상기 초기 모집단 세트는 단백질을 발현할 수 있는 복수의 초기 후보 핵산 서열을 포함하는, 단계 (106); 및 b) 상기 초기 모집단 세트에 기초하여 컴퓨터-지원된 NSGA-III 알고리즘 또는 이들의 변형을 사용하여 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하고, 이에 의해 단백질을 발현할 수 있는 복수의 최적화된 핵산 서열을 수득하는 단계 (108).Exemplary computer-implemented methods for optimizing nucleic acid sequences for protein expression in a host include: a) receiving an initial population set, wherein the initial population set is a plurality of initial populations capable of expressing a protein. Step 106, comprising a candidate nucleic acid sequence; And b) performing optimization of the harmonic index, codon context index and outlier index using a computer-assisted NSGA-III algorithm or a variant thereof based on the initial population set, whereby a plurality of proteins capable of expressing a protein Obtaining an optimized nucleic acid sequence (108).

Description

Codon optimization

ASCII 텍스트 파일 상의 서열 목록 제출Submitting a sequence listing in an ASCII text file

ASCII 텍스트 파일 상의 다음 제출의 내용은 그 전체로 본 명세서에 참조로 포함된다: 서열 목록의 컴퓨터 판독가능 형식 (CRF) (파일 이름: 759892000440SEQLIST.TXT, 기록 날짜: 2018년 7월 25일, 크기: 4 KB).The contents of the following submissions on ASCII text files are incorporated herein by reference in their entirety: Computer-readable Format of Sequence Listing (CRF) (file name: 759892000440SEQLIST.TXT, date recorded: 25 July 2018, size: 4 KB).

발명의 분야Field of invention

본 개시내용은 일반적으로 최적화 기술에 관한 것이고, 보다 구체적으로 숙주에서 단백질 발현을 위한 서열(예를 들어, 핵산 서열)을 최적화하기 위한 시스템 및 방법에 관한 것이다.The present disclosure relates generally to optimization techniques, and more particularly to systems and methods for optimizing sequences (eg, nucleic acid sequences) for protein expression in a host.

코돈 축중은 아미노산이 상이한 동의어 코돈에 의해 특정될 수 있는 현상으로 나타나는 유전적 코드의 중복성을 지칭한다. 특히, 이들 동의어 코돈은 대부분의 서열화된 게놈에서 동등하지 않은 빈도로 사용된다는 것이 발견되었다. 이 현상은 코돈-사용빈도 편향이라 불린다.Codon weight refers to the redundancy of the genetic code as a phenomenon in which amino acids can be characterized by different synonym codons. In particular, it has been found that these synonym codons are used with unequal frequencies in most sequenced genomes. This phenomenon is called the codon-frequency bias.

생물의학 및 생명공학 연구와 산업적 생산을 위해서는 올바른 폴딩 및 변형을 갖는 고-품질 단백질이 요구되기 때문에, 단백질의 발현 수준을 개선하기 위해서는 고도로-발현된 유전자의 코돈-사용빈도 편향을 반영하는 잠재적으로 유익한 규칙 및 패턴을 탐색하고 요약하는 방법이 필수적이다. 그러나, 단백질 발현은 전사, mRNA 전환, 번역 및 안정된 산물을 형성할 수 있는 후 번역 변형의 수준에서 조절을 포함하는 다단계 과정이다. 단일 동의어 코돈 치환조차도 이식유전자의 발현을 1,000-배 초과로 증가시킬 수 있다. 따라서, 코돈 최적화는 재조합 숙주에서 합성적 유전자의 최적 발현을 위해 준비된다.Since biomedical and biotechnology research and industrial production require high-quality proteins with the correct folding and modifications, improving the expression levels of proteins is potentially reflecting the codon-use bias of highly-expressed genes. A way to explore and summarize useful rules and patterns is essential. However, protein expression is a multi-step process that involves regulation at the level of transcription, mRNA conversion, translation, and post-translational modifications that can form stable products. Even a single synonym codon substitution can increase the expression of a transgene by more than 1,000-fold. Thus, codon optimization is prepared for optimal expression of the synthetic gene in the recombinant host.

간략한 개요Brief overview

다중-목표 최적화 알고리즘을 사용하여 복수의 인자들을 고려할 뿐만 아니라 균형을 이루는 향상된 코돈 최적화를 위한 시스템 및 방법이 본 명세서에 제공된다. 일부 실시형태에 따르면, 코돈 최적화는 무엇보다도 세 가지 목표인: (i) 처음에 특정 아미노산의 동의어 코돈의 계수를 할당하는 방법, (ii) 동의어 코돈의 가장 적합한 위치에 동의어 코돈을 배치하는 방법, 및 (iii) 불리하지만 우발적으로 생성된 하위서열 및/또는 모티프를 감소시키는 방법을 기반으로 한다. 일부 실시형태에서, 이들 세 가지의 목표는 조화 지수, 코돈 문맥 지수 및 이상치 지수로서 정량화된다. 최적화 동안에 목표는 비우점 분류 유전적 알고리즘 III (NSGA-III) 또는 이들의 변형과 같은 다중-목표 알고리즘을 사용하여 고려된다. 구체적으로, 고도로-발현된 유전자의 알려진 특성을 참조하여 주어진 후보 핵산 서열에 대한 목표가 계산될 수 있다. 일부 실시형태에서, 다양한 공지된 불리한 모티프 및/또는 특징 (예를 들어, 문헌에서 확인된 바와 같음)은 유전자 합성 및 단백질 발현 전에 하나 이상의 최적화된 서열로부터 제거된다.Systems and methods are provided herein for improved codon optimization that takes into account as well as balances multiple factors using a multi-goal optimization algorithm. According to some embodiments, codon optimization has, among other things, three goals: (i) how to initially allocate the coefficients of synonym codons for a particular amino acid, (ii) how to place synonym codons at the most suitable positions of the synonym codons, And (iii) a method of reducing unfavorable but accidentally generated subsequences and/or motifs. In some embodiments, these three goals are quantified as a harmonic index, a codon context index, and an outlier index. During optimization, targets are considered using a multi-goal algorithm such as Non-Dominant Classification Genetic Algorithm III (NSGA-III) or a variant thereof. Specifically, a target can be calculated for a given candidate nucleic acid sequence with reference to the known properties of the highly-expressed gene. In some embodiments, various known adverse motifs and/or features (eg, as identified in the literature) are removed from one or more optimized sequences prior to gene synthesis and protein expression.

따라서, 본 발명은 바람직하게는 이에 의해 코돈 조화, 코돈 사용빈도 (예를 들어, 동의어 코돈 분포), 코돈 문맥 지수, cis-작용 mRNA 불안정화 모티프, RNase 스플라이싱 부위, GC-함량, 리보솜 결합 부위 (RBS), 유전자의 mRNA 이차 구조 (예를 들어, mRNA 자유 에너지) 및 반복 요소를 포함하나 이에 제한되지 않는 단백질 발현에 영향을 미치는 모든 또는 대부분의 매개변수 및 인자가 포유동물, 곤충, 효모, 박테리아, 조류와 같은 진핵 세포 및 원핵 세포 둘 모두를 포함하는 발현 숙주 세포에서, 그리고 무-세포 발현 시스템에서와 같은, 발현 시스템에서 유전자의 단백질 발현을 끌어올리기 위해 핵산 서열을 개선하고 최적화하기 위한 고려에 취해지는 체계적인 방법을 제공한다.Accordingly, the present invention is preferably thereby codon matching, codon usage (e.g., synonymous codon distribution), codon context index, cis-acting mRNA destabilization motif, RNase splicing site, GC-content, ribosome binding site. (RBS), the mRNA secondary structure of the gene (e.g., mRNA free energy) and all or most of the parameters and factors affecting protein expression including, but not limited to, repeating elements are mammals, insects, yeast, Considerations for improving and optimizing nucleic acid sequences to elevate protein expression of genes in expression host cells, including both eukaryotic and prokaryotic cells, such as bacteria and algae, and in expression systems, such as in cell-free expression systems. It provides a systematic way to be taken on.

일부 실시형태에서, a) 초기 모집단 세트를 수신하는 단계로서, 여기서 상기 초기 모집단 세트는 단백질을 발현할 수 있는 복수의 초기 후보 핵산 서열을 포함하는, 단계; 및 b) 상기 초기 모집단 세트에 기초하여 컴퓨터-지원된 NSGA-III 알고리즘 또는 이들의 변형을 사용하여 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하고, 이에 의해 단백질을 발현할 수 있는 복수의 최적화된 핵산 서열을 수득하는 단계로, 여기서 상기 후보 핵산 서열의 조화 지수는 복수의 고도로 발현된 유전자와 후보 핵산 서열 사이의 동의어 코돈의 사용 빈도 분포의 일관성을 나타내며, 여기서 후보 핵산 서열의 코돈 문맥 지수는 동의어 코돈을 적절한 위치에 배치하기 위한 척도이고, 여기서 후보 핵산 서열의 이상치 지수는 후보 핵산 서열에 대한 복수의 미리결정된 서열 특징의 부정적 효과의 척도인, 단계를 포함하는, 숙주에서 단백질의 발현을 위해 핵산 서열을 최적화하기 위한 컴퓨터 구현-방법이 제공된다.In some embodiments, a) receiving an initial population set, wherein the initial population set comprises a plurality of initial candidate nucleic acid sequences capable of expressing a protein; And b) performing optimization of the harmonic index, codon context index and outlier index using a computer-assisted NSGA-III algorithm or a variant thereof based on the initial population set, whereby a plurality of proteins capable of expressing a protein Obtaining an optimized nucleic acid sequence, wherein the harmony index of the candidate nucleic acid sequence indicates the consistency of the frequency distribution of synonym codons between the plurality of highly expressed genes and the candidate nucleic acid sequence, wherein the codon context index of the candidate nucleic acid sequence Is a measure for placing synonym codons in appropriate positions, wherein the outlier index of the candidate nucleic acid sequence is a measure of the negative effect of a plurality of predetermined sequence features on the candidate nucleic acid sequence. Computer-implemented-methods for optimizing nucleic acid sequences for use are provided.

일부 실시형태에서, 본 방법은 복수의 최적화된 핵산 서열 중 적어도 하나의 최적화된 핵산 서열을 나타내는 출력을 제공하는 단계를 추가로 포함한다.In some embodiments, the method further comprises providing an output representative of at least one optimized nucleic acid sequence of the plurality of optimized nucleic acid sequences.

일부 실시형태에서, 초기 모집단 세트를 수신하는 단계는 단백질 서열을 수신하는 단계; 수신된 단백질 서열을 기반으로 초기 모집단 세트를 생성하는 단계를 포함한다.In some embodiments, receiving the initial population set comprises receiving a protein sequence; And generating an initial population set based on the received protein sequence.

일부 실시형태에서, 초기 모집단 세트를 수신하는 단계는 핵산 서열을 수신하는 단계; 수신된 핵산 서열을 단백질 서열로 번역하는 단계; 단백질 서열을 기반으로 초기 모집단 세트를 생성하는 단계를 포함한다.In some embodiments, receiving the initial population set comprises receiving a nucleic acid sequence; Translating the received nucleic acid sequence into a protein sequence; And generating an initial population set based on the protein sequence.

일부 실시형태에서, 초기 모집단 세트는 미리결정된 크기의 것이다.In some embodiments, the initial population set is of a predetermined size.

일부 실시형태에서, 초기 모집단 세트는 복수의 초기 후보 핵산 서열의 이진수 표현을 포함한다.In some embodiments, the initial population set comprises a binary representation of a plurality of initial candidate nucleic acid sequences.

일부 실시형태에서, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 조화 지수를 최대화하는 단계; 코돈 문맥 지수를 최대화하는 단계; 및 이상치 지수를 최소화하는 단계를 포함한다.In some embodiments, performing optimization of the harmonic index, codon context index, and outlier index comprises: maximizing the harmonic index; Maximizing the codon context index; And minimizing the outlier index.

일부 실시형태에서, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 초기 모집단 세트의 각각의 초기 후보 핵산 서열에 대해, 각각의 초기 후보 핵산 서열에 대한 각각의 조화 지수 값, 각각의 코돈 문맥 지수 값, 및 각각의 이상치 지수 값을 계산하는 단계; 상기 계산하는 단계에 기초하여, 복수의 초기 후보 핵산 서열에 상응하는 복수의 적합성 값을 할당하는 단계; 상기 복수의 적합성 값에 기초하여, 복수의 초기 후보 핵산 서열을 분류하는 단계; 및 후속하는 모집단 세트에 상기 분류된 복수의 초기 후보 핵산 서열의 서브세트를 포함시키는 단계를 포함한다. 일부 실시형태에서, 상기 복수의 적합성 값은 후보 핵산 서열에 대한 조화 지수, 코돈 문맥 지수 및 이상치 지수를 포함한다.In some embodiments, the step of performing optimization of the harmonic index, codon context index, and outlier index comprises, for each initial candidate nucleic acid sequence of the initial population set, each harmonic index value for each initial candidate nucleic acid sequence, each Calculating a codon context index value and each outlier index value; Allocating a plurality of suitability values corresponding to a plurality of initial candidate nucleic acid sequences based on the calculating step; Classifying a plurality of initial candidate nucleic acid sequences based on the plurality of suitability values; And including a subset of the sorted plurality of initial candidate nucleic acid sequences in a subsequent population set. In some embodiments, the plurality of suitability values comprises a harmony index, a codon context index, and an outlier index for a candidate nucleic acid sequence.

일부 실시형태에서, 본 방법은 초기 모집단에 기초하여 자손 모집단을 생성하는 단계; 및 후속 모집단 세트에 자손 모집단을 포함시키는 단계를 더 포함한다.In some embodiments, the method further comprises generating a progeny population based on the initial population; And including the progeny population in the subsequent population set.

일부 실시형태에서, 자손 모집단은 이진수 토너먼트 선택, 교차/재조합, 돌연변이 또는 이들의 임의의 조합을 통해 생성된다.In some embodiments, the progeny population is generated through binary tournament selection, cross/recombination, mutation, or any combination thereof.

일부 실시형태에서, 초기 모집단 세트와 후속 모집단 세트는 동일한 크기이다.In some embodiments, the initial population set and the subsequent population set are the same size.

일부 실시형태에서, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 복수의 반복을 포함하며, 여기서 복수의 반복의 i-번째 반복은: (i-1)번째 반복에 상응하는 핵산 서열의 모집단 세트를 수신하는 단계; (i-1)번째 반복에 상응하는 모집단 세트의 각 핵산 서열을 비-우점 수준과 연관시키는 단계; 연관된 비-우점 수준에 기초하여 (i-1)번째 반복에 상응하는 모집단 세트에서 핵산 서열을 분류하는 단계; i-번째 반복에 상응하는 모집단 집합을 생성하는 단계로, 여기서 i-번째 반복에 상응하는 모집단 집합은 (i-1)번째 반복에 상응하는 분류된 핵산 서열의 서브세트 및 (i-1)번째 반복에 상응하는 분류된 핵산 서열에 기초하여 생성된 자손 모집단을 포함하는, 단계; 및 하나 이상의 종료 조건에 기초하여, i-번째 반복에 상응하는 모집단 세트를 사용하여 (i+1)번째 반복으로 진행할지 여부를 결정하는 단계를 포함한다.In some embodiments, the step of performing optimization of the harmonic index, codon context index, and outlier index comprises a plurality of iterations, wherein the i-th iteration of the plurality of iterations is: a nucleic acid corresponding to the (i-1)th iteration Receiving a population set of sequences; associating each nucleic acid sequence of the population set corresponding to the (i-1) th iteration with a non-dominant level; Classifying nucleic acid sequences in the population set corresponding to the (i-1) th iteration based on the associated non-dominant level; generating a population set corresponding to the i-th iteration, wherein the population set corresponding to the i-th iteration is a subset of classified nucleic acid sequences corresponding to the (i-1)th iteration and the (i-1)th iteration Comprising a progeny population generated based on the classified nucleic acid sequence corresponding to the repeat; And determining whether to proceed to the (i+1) th iteration using the population set corresponding to the i-th iteration, based on the one or more termination conditions.

일부 실시형태에서, 각 핵산 서열을 비-우점 수준과 연관시키는 단계는 (i-1)번째 반복에 상응하는 모집단 세트의 각 핵산 서열에 대해, 각각의 조화 지수 값, 각각의 코돈 문맥 지수 값 및 각각의 이상치 지수 값을 계산하는 단계를 포함한다.In some embodiments, associating each nucleic acid sequence with a non-dominant level comprises, for each nucleic acid sequence in the population set corresponding to the (i-1) th iteration, each harmony index value, each codon context index value, and And calculating each outlier index value.

일부 실시형태에서, i-번째 반복에 상응하는 모집단 세트를 생성하는 단계는 (i-1)번째 반복에 상응하는 분류된 핵산 서열의 적어도 하나의 핵산 서열을 복수의 미리결정된 기준점 중 하나와 연관시키는 단계를 포함한다.In some embodiments, generating a population set corresponding to the i-th iteration comprises associating at least one nucleic acid sequence of the classified nucleic acid sequence corresponding to the (i-1)th iteration with one of a plurality of predetermined reference points. Includes steps.

일부 실시형태에서, 하나 이상의 종료 조건은 고정된 횟수의 반복에 도달하고 최상의 적합성에 도달하고 더 나은 결과가 생성되지 않음, 최적에 근접한 솔루션의 최소 기준이 일부 솔루션에 의해 만족됨, 또는 이들의 임의의 조합을 포함한다.In some embodiments, one or more termination conditions reach a fixed number of iterations and reach the best fit and do not produce better results, the minimum criterion of a solution that is close to optimal is satisfied by some solutions, or any of these. Includes a combination of.

일부 실시형태에서, 후보 핵산 서열의 조화 지수는 식: H = 1-D(F _hs ,F _ts )에 기초하여 계산되며, 상기 식에서 D()는 거리 함수를 나타내고; 여기서 F_hs는 복수의 고도로 발현된 유전자 내의 복수의 아미노산의 동의어 코돈의 빈도를 포함하는 벡터를 포함하고; F_ts는 후보 핵산 서열의 코딩 유전자 내에 복수의 아미노산의 동의어 코돈의 빈도를 포함하는 벡터를 포함한다.In some embodiments, the harmony index of the candidate nucleic acid sequence is calculated based on the formula: H = 1-D( F _hs ,F _ts ), wherein D() represents a distance function; Wherein F _hs comprises a vector comprising the frequency of synonym codons of a plurality of amino acids in a plurality of highly expressed genes; F _ts includes a vector containing the frequency of synonym codons of a plurality of amino acids in the coding gene of the candidate nucleic acid sequence.

일부 실시형태에서, D()는 두 벡터 사이의 거리를 측정하는 함수를 나타낸다. 일부 실시형태에서, D()는 두 벡터의 유클리드 거리, 코사인 거리, 맨하탄 거리, 또는 민코프스키 거리를 포함하지만 이에 제한되지 않는 거리 함수이다.In some embodiments, D() represents a function that measures the distance between two vectors. In some embodiments, D() is a distance function including, but not limited to, a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance of two vectors.

일부 실시형태에서, 복수의 고도로 발현된 유전자 또는 후보 핵산 서열의 동의어 코돈의 빈도는 다음과 같이 정의된다:

In some embodiments, the frequency of synonym codons of a plurality of highly expressed genes or candidate nucleic acid sequences is defined as follows:

일부 실시형태에서, 후보 핵산 서열의 코돈 문맥 지수는 식: CC = 1 - D(F _hcc ,F _tcc )에 기초하여 계산되며, 상기 식에서 D()는 거리 함수를 나타내고; 여기서 F_hcc는 복수의 고도로 발현된 유전자 내의 2개의 연속 아미노산의 동의어 코돈 쌍의 빈도를 포함하는 벡터를 포함하고; F_tcc는 후보 핵산 서열의 코딩 유전자 내에 2개의 연속 아미노산의 동의어 코돈 쌍의 빈도를 포함하는 벡터를 포함한다.In some embodiments, the codon context index of the candidate nucleic acid sequence is calculated based on the formula: CC = 1 -D( F _hcc ,F _tcc ), wherein D() represents a distance function; Wherein F _hcc comprises a vector comprising the frequency of synonym codon pairs of two consecutive amino acids in a plurality of highly expressed genes; F _tcc comprises a vector comprising the frequency of synonym codon pairs of two consecutive amino acids within the coding gene of the candidate nucleic acid sequence.

일부 실시형태에서, D()는 두 벡터 사이의 거리를 측정하는 함수를 나타낸다. 일부 실시형태에서, D()는 두 벡터의 유클리드 거리, 코사인 거리, 맨하탄 거리, 또는 민코프스키 거리를 포함하지만 이에 제한되지는 않는 거리 함수이다.In some embodiments, D() represents a function that measures the distance between two vectors. In some embodiments, D() is a distance function including, but not limited to, a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance of two vectors.

일부 실시형태에서, 복수의 고도로 발현된 유전자 또는 후보 핵산 서열의 동의어 코돈 쌍의 빈도는 다음과 같이 정의된다:

In some embodiments, the frequency of synonym codon pairs of a plurality of highly expressed genes or candidate nucleic acid sequences is defined as follows:

일부 실시형태에서, 이상치 지수는 식

에 기초하여 계산되며, 상기에서 N은 복수의 미리결정된 서열 특징의 수이고; 여기서 fi(x)는 복수의 미리결정된 서열 특징 중 i번째 서열 특징의 페널티 스코어링 함수를 나타내고; wi는 fi(x)와 연관된 상대적 가중치를 나타낸다.In some embodiments, the outlier exponent is

Where N is the number of a plurality of predetermined sequence features; Where fi(x) represents a penalty scoring function of the i-th sequence feature among a plurality of predetermined sequence features; wi denotes a relative weight associated with fi(x).

일부 실시형태에서, 복수의 미리결정된 특징은 GC-함량 값, CIS 요소, 반복 요소, RNA 스플라이싱 부위, 리보솜 결합 서열, mRNA의 최소 자유 에너지, 또는 이들의 임의의 조합을 포함한다.In some embodiments, the plurality of predetermined characteristics comprises a GC-content value, a CIS element, a repeat element, an RNA splicing site, a ribosome binding sequence, a minimum free energy of an mRNA, or any combination thereof.

일부 실시형태에서, 복수의 미리결정된 특징은 선택된 발현 시스템에 기초하여 확인된다.In some embodiments, a plurality of predetermined features are identified based on the selected expression system.

일부 실시형태에서, NSGA-III 알고리즘의 변형은 EliteNSGA-III 알고리즘 또는 NSGA-II 기반 면역 알고리즘을 포함한다.In some embodiments, the modification of the NSGA-III algorithm comprises an EliteNSGA-III algorithm or an NSGA-II based immune algorithm.

일부 실시형태에서, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 조화 지수의 내림 차순, 그 다음 코돈 문맥 지수의 내림 차순, 그리고 그 다음 이상치 지수의 오름 차순에 의해 복수의 최적화된 핵산 서열을 순위 매김하는 단계; 합성을 위한 하나 이상의 최상위의 최적화된 핵산 서열을 선택하는 단계를 포함한다.In some embodiments, the step of performing optimization of the harmonic index, codon context index, and outlier index comprises a plurality of optimized values by descending order of harmonic index, then descending order of codon context index, and then ascending order of outlier index. Ranking the nucleic acid sequences; Selecting one or more topmost optimized nucleic acid sequences for synthesis.

일부 실시형태에서, 본 방법은 c) 복수의 최적화된 핵산 서열의 최적화된 핵산 서열로부터 미리결정된 불리한 하위서열 또는 모티프를 제거하는 단계를 추가로 포함한다.In some embodiments, the method further comprises c) removing a predetermined adverse subsequence or motif from the optimized nucleic acid sequence of the plurality of optimized nucleic acid sequences.

일부 실시형태에서, 미리결정된 불리한 하위서열 또는 모티프는 복수의 텍스트 부분의 분석에 기초하여 확인된다.In some embodiments, a predetermined adverse subsequence or motif is identified based on analysis of a plurality of text portions.

일부 실시형태에서, 미리결정된 불리한 하위서열 또는 모티프를 제거하는 단계는 최적화된 핵산 서열에서 미리결정된 불리한 하위서열 또는 모티프를 확인하는 단계; 확인된 미리결정된 불리한 하위서열 또는 모티프에 기초하여 복수의 동의어 코돈을 확인하는 단계; 최적화된 핵산 서열에서 확인된 미리결정된 불리한 하위서열로 치환하기 위해 복수의 동의어 코돈으로부터 동의어 코돈을 선택하는 단계를 포함한다.In some embodiments, removing the predetermined adverse subsequence or motif comprises identifying a predetermined adverse subsequence or motif in the optimized nucleic acid sequence; Identifying a plurality of synonym codons based on the identified predetermined adverse subsequence or motif; Selecting a synonym codon from the plurality of synonym codons for substitution with a predetermined adverse subsequence identified in the optimized nucleic acid sequence.

일부 실시형태에서, 조화 지수, 코돈 문맥 지수 및 이상치 지수 중 적어도 하나는 하나 이상의 데이터베이스로부터 복수의 고도로-발현된 유전자의 하나 이상의 특성에 기초하여 계산된다.In some embodiments, at least one of the harmonic index, codon context index, and outlier index is calculated based on one or more characteristics of the plurality of highly-expressed genes from one or more databases.

일부 실시형태에서, 하나 이상의 특성은 코돈 빈도, 동의어 코돈 빈도, 코돈 쌍 빈도 또는 이들의 조합을 포함한다.In some embodiments, the one or more characteristics include codon frequencies, synonym codon frequencies, codon pair frequencies, or combinations thereof.

일부 실시형태에서, 본 방법은 하나 이상의 매개변수를 설정하는 단계를 더 포함하며, 여기서 하나 이상의 매개변수는 모집단 세트의 크기, 분할의 수, 시뮬레이션된 이진수 교차에 대한 분포 지수, 시뮬레이션된 이진수 교차에 대한 교차 비율, 비트 플립 돌연변이에 대한 돌연변이 비율, 비트 플립 돌연변이에 대한 분포 지수, 또는 이들의 임의의 조합을 포함한다.In some embodiments, the method further comprises setting one or more parameters, wherein the one or more parameters are the size of the population set, the number of divisions, the distribution index for the simulated binary intersection, the simulated binary intersection. The crossover ratio for the bit flip mutation, the mutation ratio for the bit flip mutation, the distribution index for the bit flip mutation, or any combination thereof.

일부 실시형태에서, 하나 이상의 프로그램을 저장하는 비-일시적 컴퓨터-판독가능 저장 매체가 제공되며, 상기 하나 이상의 프로그램은 전자 장치의 하나 이상의 프로세서에 의해 실행될 때 전자 장치가 본 명세서에 기재된 임의의 방법을 수행하는 것을 야기하는 명령을 포함한다.In some embodiments, a non-transitory computer-readable storage medium is provided that stores one or more programs, the one or more programs being executed by one or more processors of the electronic device, the electronic device using any of the methods described herein. Contains commands that cause them to perform.

일부 실시형태에서, 숙주에서 단백질 발현을 위해 핵산 서열을 최적화하기 위한 시스템이 제공되며, 상기 시스템은 하나 이상의 프로세서; 메모리; 및 하나 이상의 프로그램을 포함하며, 여기서 상기 하나 이상의 프로그램은 메모리에 저장되고 하나 이상의 프로세서에 의해 실행되도록 구성되며, 상기 하나 이상의 프로그램은 본 명세서에 기재된 임의의 방법을 수행하기 위한 명령을 포함한다.In some embodiments, a system is provided for optimizing a nucleic acid sequence for protein expression in a host, the system comprising: one or more processors; Memory; And one or more programs, wherein the one or more programs are stored in memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing any of the methods described herein.

일부 실시형태에서, 숙주에서 단백질의 발현을 위해 핵산 서열을 최적화하기 위한 전자 장치가 제공되며, 상기 장치는 본 명세서에 기재된 임의의 방법을 수행하기 위한 수단을 포함한다.In some embodiments, an electronic device is provided for optimizing a nucleic acid sequence for expression of a protein in a host, the device comprising means for performing any of the methods described herein.

일부 실시형태에서, 숙주에서 단백질의 발현을 위해 핵산 서열을 최적화하기 위해 기록가능한 매체에 저장된 프로그램 제품이 제공되며, 상기 프로그램 제품은 본 명세서에 기재된 임의의 방법을 수행하기 위한 컴퓨터 소프트웨어를 포함한다.In some embodiments, a program product stored in a recordable medium is provided for optimizing a nucleic acid sequence for expression of a protein in a host, the program product comprising computer software for performing any of the methods described herein.

일부 실시형태에서, 본 명세서에 기재된 임의의 방법으로부터 수득된 최적화된 핵산 서열을 포함하는 단리된 핵산 분자가 제공된다.In some embodiments, an isolated nucleic acid molecule is provided comprising an optimized nucleic acid sequence obtained from any of the methods described herein.

일부 실시형태에서, 상기-언급된 단리된 핵산 분자를 포함하는 벡터가 제공된다.In some embodiments, a vector comprising the above-mentioned isolated nucleic acid molecule is provided.

일부 실시형태에서, 상기-언급된 단리된 핵산 분자 또는 상기-언급된 벡터를 포함하는 재조합 숙주 세포가 제공된다.In some embodiments, a recombinant host cell comprising the above-mentioned isolated nucleic acid molecule or above-mentioned vector is provided.

일부 실시형태에서, 숙주 세포에서 단백질을 발현시키기 위한 방법이 제공되며, 상기 방법은: (a) 본 명세서에 기재된 임의의 방법을 사용하여 숙주 세포에서 단백질의 발현을 위한 최적화된 핵산 서열을 수득하는 단계, (b) 상기 최적화된 핵산 서열을 포함하는 핵산 분자를 합성하는 단계; (c) 상기 핵산 분자를 숙주 세포에 도입하여 재조합 숙주 세포를 수득하는 단계; 및 (d) 상기 재조합 숙주 세포를 상기 최적화된 핵산 서열로부터 단백질의 발현을 허용하는 조건 하에서 배양하는 단계를 포함한다.In some embodiments, a method for expressing a protein in a host cell is provided, the method comprising: (a) obtaining an optimized nucleic acid sequence for expression of the protein in a host cell using any of the methods described herein. Step, (b) synthesizing a nucleic acid molecule comprising the optimized nucleic acid sequence; (c) introducing the nucleic acid molecule into a host cell to obtain a recombinant host cell; And (d) culturing the recombinant host cell under conditions that allow expression of the protein from the optimized nucleic acid sequence.

도 1은 일부 실시형태에 따른, 코돈 최적화를 위한 예시적인 프로세스의 블록도를 도시한다.
도 2a는 일부 실시형태에 따른, 숙주에서 단백질의 발현을 위한 서열 (예를 들어, 핵산 서열)을 최적화하기 위한 알고리즘을 구축하고 실행하기 위한 예시적인 파이프라인을 도시한다.
도 2b는 일부 실시형태에 따른, 유전적 알고리즘의 예시적인 일반적인 워크플로우를 도시한다.
도 3은 일부 실시형태에 따른, 그 야생형에 비해 최적화된 GFP 및 JNK3A1의 웨스턴 블랏 결과를 도시한다.
도 4는 일부 실시형태에 따른, 예시적인 전자 장치를 도시한다.1 shows a block diagram of an exemplary process for codon optimization, in accordance with some embodiments.
2A depicts an exemplary pipeline for building and executing an algorithm for optimizing a sequence (eg, a nucleic acid sequence) for expression of a protein in a host, in accordance with some embodiments.
2B shows an exemplary general workflow of a genetic algorithm, in accordance with some embodiments.
3 shows Western blot results of GFP and JNK3A1 optimized compared to their wild type, according to some embodiments.
4 illustrates an exemplary electronic device, in accordance with some embodiments.

본 발명은 대장균, CHO, HEK293, 효모, 곤충, 무-세포 발현 시스템 등을 포함하지만 이에 제한되지 않는 다양한 숙주에서 유전자의 재조합 발현을 개선하기 위한 향상된 코돈 최적화를 제공한다. 본 발명에 따른 예시적인 시스템은 발현 시스템을 위한 고도로-발현된 유전자를 수집하고, 기본 서열 특징을 추출하고, 관심 있는 서열 (예를 들어, 핵산 서열)에서 유익한 포괄적 패턴을 복제하고, 발현 시스템에서 표적 유전자의 발현을 개선하기 위해 불리한 특징을 제거한다.The present invention provides improved codon optimization to improve the recombinant expression of genes in a variety of hosts including, but not limited to, E. coli, CHO, HEK293, yeast, insect, cell-free expression systems, and the like. Exemplary systems according to the invention collect highly-expressed genes for expression systems, extract basic sequence features, replicate beneficial generic patterns in sequences of interest (e.g., nucleic acid sequences), and in expression systems Adverse features are removed to improve the expression of the target gene.

현재, 다수의 코돈 최적화의 도구가 개발되었으며 아래 표 1에 요약되어 있다. 코돈 사용 (예를 들어, 코돈 적응 지수 [CAI], 유효한 코돈 수 [ENc], 상대적 동의어 코돈 사용 [RSCU] 및 동의어 코돈 사용 순서 [SCUO]), 코돈 쌍, tRNA 사용 (예를 들어, tRNA 적응 지수 [tAI]), GC-함량, 리보솜 결합 부위 (RBS), 숨겨진 정지 코돈, 모티프 회피, 제한 부위 제거, 유전자의 mRNA 2차 구조 (예를 들어, mRNA 자유 에너지) 및 수치료법 지수 최적화를 포함하는 매개변수 및 인자들 중 다수, 바람직하게는 대부분 또는 모두가 박테리아, 효모, 곤충 및 포유동물 세포의 코돈 최적화 동안 발현을 끌어올리기 위해 이들 도구들이 고려되어 졌다.Currently, a number of codon optimization tools have been developed and are summarized in Table 1 below. Codon usage (eg, codon adaptation index [CAI], number of valid codons [ENc], relative synonym codon usage [RSCU] and synonym codon usage sequence [SCUO]), codon pairs, tRNA usage (e.g., tRNA adaptation Index [tAI]), GC-content, ribosome binding site (RBS), hidden stop codons, motif avoidance, restriction site removal, mRNA secondary structure of the gene (eg, mRNA free energy) and hydrotherapy index optimization. These tools have been considered to elevate expression during codon optimization of bacteria, yeast, insect and mammalian cells, of many, preferably most or all, of these parameters and factors.

표 1Table 1

그러나, 너무 많은 인자들이 요점에 대해 고려될 수 있기 때문에, 이들을 균형있게 하는 방법은 다수의 목표 최적화 문제이지만 상기 목표는 서로 충돌할 수 있기 때문에 어려움이 있다. 다른 한편, 고려사항으로부터 하나 이상의 인자 또는 매개변수를 생략하는 것은 발현 시스템에서 표적 유전자의 낮은 또는 무 발현을 초래할 수 있다.However, since too many factors can be taken into account for the point, how to balance them is a multiple target optimization problem, but it is difficult because the targets can conflict with each other. On the other hand, omission of one or more factors or parameters from consideration can result in low or no expression of the target gene in the expression system.

따라서, 일 양태에서 본 발명은 다수의 (예를 들어, 2개 초과) 목표를 최적화하기 위해 NSGA-III 알고리즘 또는 이들의 변형을 사용하여 개선된 재조합 단백질 발현을 위한 서열 최적화를 위한 방법을 제공한다. 또 다른 양태에서, 유전자 합성 및 단백질 발현 전에 (예를 들어, NSGA-III 알고리즘의 반복이 완료된 후) 핵산 서열로부터 불리한 모티프 및 특징을 제거하는 방법이 제공된다. 또한 최적화 알고리즘에서 다수의 목표를 정량화하고 계산하는 방법뿐만 아니라 감소시키거나 제거할 불리한 모티프 및 특징을 확인하는 방법이 제공된다.Thus, in one aspect, the invention provides a method for sequence optimization for improved recombinant protein expression using the NSGA-III algorithm or modifications thereof to optimize multiple (e.g., more than two) targets. . In another aspect, methods are provided for removing adverse motifs and features from nucleic acid sequences prior to gene synthesis and protein expression (eg, after repetition of the NSGA-III algorithm has been completed). Also provided are methods for quantifying and calculating multiple targets in an optimization algorithm, as well as identifying adverse motifs and features to reduce or eliminate.

또한 본 명세서에 기재된 방법 중 임의의 하나 이상의 단계를 수행하기 위한 하나 이상의 프로그램을 저장하기 위한 시스템, 비-일시적 컴퓨터-판독가능 저장 매체, 전자 장치 및 프로그램 제품이 제공된다. 본 명세서에 기재된 방법으로부터 수득된 최적화된 핵산 서열을 포함하는 분리된 핵산 분자; 상기 분리된 핵산 분자를 포함하는 벡터; 상기 분리된 핵산 분자 또는 상기 벡터를 포함하는 재조합 숙주 세포가 또한 제공된다. 본 명세서에 기재된 임의의 방법을 포함하는 숙주 세포에서 단백질을 발현하는 방법이 또한 제공된다.Also provided are systems, non-transitory computer-readable storage media, electronic devices, and program products for storing one or more programs for performing any one or more steps of any of the methods described herein. An isolated nucleic acid molecule comprising an optimized nucleic acid sequence obtained from the methods described herein; A vector containing the isolated nucleic acid molecule; Also provided is a recombinant host cell comprising the isolated nucleic acid molecule or the vector. Also provided are methods of expressing a protein in a host cell, including any of the methods described herein.

본 명세서에 기재된 본 발명의 실시형태는 실시형태로 "구성된" 및/또는 "본질적으로 구성된"을 포함하는 것으로 이해된다.It is understood that the embodiments of the invention described herein include “consisting of” and/or “consisting essentially of” of the embodiments.

본 명세서에서 값 또는 매개변수에 대한 "약"에 대한 언급은 그 값 또는 매개변수 자체에 대한 변형을 포함 (및 기술)한다. 예를 들어, "약 X"에 대해 언급하는 기술은 "X"의 기술을 포함한다.Reference to "about" for a value or parameter in this specification includes (and describes) modifications to the value or parameter itself. For example, description referring to “about X” includes description of “X”.

본 명세서에서 사용된 바와 같이, 값 또는 매개변수에 대한 "아님"에 대한 언급은 일반적으로 값 또는 매개변수 "이외"를 의미하고 기술한다. 예를 들어, 본 방법이 X 유형의 암을 치료하는 데 사용되지 않는다는 것은 본 방법이 X 이외 유형의 암을 치료하는 데 사용된다는 것을 의미한다.As used herein, reference to “not” to a value or parameter generally means and describes “other than” a value or parameter. For example, that the method is not used to treat type X cancer means that the method is used to treat type X cancer.

본 명세서 및 첨부된 청구범위에서 사용된 바와 같이, 단수 형태 "a", "또는" 및 "the"는 문맥이 달리 명확하게 지시하지 않는 한 복수의 지시대상을 포함한다.As used in this specification and the appended claims, the singular forms “a”, “or” and “the” include plural referents unless the context clearly dictates otherwise.

본 명세서 및 첨부된 청구범위에서 사용된 바와 같이, "세트"는 문맥이 달리 명확하게 지시하지 않는 한 하나 또는 복수의 대상을 지칭한다.As used in this specification and the appended claims, “set” refers to one or more objects unless the context clearly dictates otherwise.

코돈 최적화의 방법Method of codon optimization

일 양태에서 본 발명은 숙주에서 단백질의 발현을 위해 핵산 서열을 최적화하는 방법 (예를 들어, 컴퓨터-구현 또는 컴퓨터-보조 방법)을 제공한다. 유전자 합성 및 단백질 발현 전에 (예를 들어, NSGA-III 알고리즘의 반복이 완료된 후) 핵산 서열로부터 불리한 모티프 및 특징을 제거하는 방법이 이들 방법에 대해 관련된다. 또한 최적화 알고리즘에서 다수의 목표를 정량화하고 계산하는 방법뿐만 아니라 감소시키거나 제거할 불리한 모티프 및 특징을 확인하는 방법도 이들 방법에 관련된다.In one aspect, the invention provides a method of optimizing a nucleic acid sequence for expression of a protein in a host (eg, computer-implemented or computer-assisted method). Methods of removing adverse motifs and features from nucleic acid sequences prior to gene synthesis and protein expression (eg, after repetition of the NSGA-III algorithm has been completed) are relevant for these methods. Also relevant to these methods is how to quantify and calculate multiple targets in the optimization algorithm, as well as how to identify adverse motifs and features to reduce or eliminate.

도 1은 선택적 단계를 나타내는 대시 블록을 갖는 코돈 최적화를 위한 예시적인 프로세스 (100)를 예시한다. 프로세스 (100)의 일부는 특정 장치에 의해 수행되는 것으로 본 명세서에서 기재되어 있지만, 프로세스 (100)는 그렇게 제한되지 않음이 이해될 것이다. 다른 예들에서, 프로세스 (100)는 단지 단일 전자 장치 (예를 들어, 전자 장치 (400)) 또는 다수의 전자 장치를 사용하여 수행된다. 프로세스 (100)에서, 일부 블록은 선택적으로 결합되고, 일부 블록의 순서는 선택적으로 변경되고, 일부 블록은 선택적으로 생략된다. 일부 예에서, 부가의 단계가 프로세스 (100)와 결합하여 수행될 수 있다.1 illustrates an exemplary process 100 for codon optimization with dash blocks representing optional steps. While some of the process 100 is described herein as being performed by a particular device, it will be understood that the process 100 is not so limited. In other examples, process 100 is performed using only a single electronic device (eg, electronic device 400) or multiple electronic devices. In process 100, some blocks are selectively combined, the order of some blocks is selectively changed, and some blocks are selectively omitted. In some examples, additional steps may be performed in combination with process 100.

블록 (106)에서, 전자 장치는 초기 모집단 세트를 수신하며, 여기서 초기 모집단 세트는 단백질을 발현할 수 있는 복수의 초기 후보 핵산 서열을 포함한다. 일부 실시형태에서, 초기 모집단 세트는 무작위로 생성된다. 일부 실시형태에서, 초기 모집단 세트는 미리결정된 크기 (예를 들어, 사용자에 의해 결정됨)의 것이다.At block 106, the electronic device receives an initial population set, wherein the initial population set includes a plurality of initial candidate nucleic acid sequences capable of expressing a protein. In some embodiments, the initial population set is randomly generated. In some embodiments, the initial population set is of a predetermined size (eg, determined by the user).

일부 실시형태에서, 블록 (106)에 도시된 바와 같이, 초기 모집단 세트를 수신하는 단계는 단백질 서열에 기초하여 초기 모집단 세트를 생성하는 단계를 포함한다. 예를 들어, 초기 모집단 세트를 수신하는 단계는 단백질 서열을 수신하는 단계 (예를 들어, 사용자로부터의 입력으로); 및 수신된 단백질 서열에 기초하여 초기 모집단 세트를 생성하는 단계를 포함할 수 있다. 다른 예로서, 초기 모집단 세트를 수신하는 단계는 핵산 서열을 수신하는 단계 (예를 들어, 사용자로부터의 입력으로); 수신된 핵산 서열을 단백질 서열로 번역하는 단계; 단백질 서열에 기초하여 초기 모집단 세트를 생성하는 단계를 포함할 수 있다.In some embodiments, as shown at block 106, receiving the initial population set includes generating an initial population set based on the protein sequence. For example, receiving an initial population set may include receiving a protein sequence (eg, as input from a user); And generating an initial population set based on the received protein sequence. As another example, receiving an initial population set may include receiving a nucleic acid sequence (eg, as input from a user); Translating the received nucleic acid sequence into a protein sequence; And generating an initial population set based on the protein sequence.

일부 실시형태에서, 초기 모집단 세트는 복수의 초기 후보 핵산 서열의 이진수 표현 (예를 들어, 이진수 문자열)을 포함한다. 일반적으로, 코돈 목록/배열/벡터가 아닌 이진수 문자열은 코딩 유전자를 나타내는 데이터 구조로 선택되고, 모집단 초기화, 교차/재조합, 돌연변이, 선택을 포함한 유전적 알고리즘의 모든 조작 대상은 선택 전에 유전자의 적합성 평가를 제외한 이진수 문자열이다. 아래에 추가로 설명되는 바와 같이, 일부 실시형태에서, 적합성 함수 (즉, 3개의 지수 함수)가 선택 전에 전체 모집단의 각각의 개체에 대해 평가될 것이 필요할 때, 이진수 표현은 일시적으로 코돈 문자열로 다시 변환되어야 한다.In some embodiments, the initial population set comprises a binary representation of a plurality of initial candidate nucleic acid sequences (eg, binary strings). In general, a binary string that is not a codon list/array/vector is selected as a data structure representing a coding gene, and all manipulation targets of a genetic algorithm including population initialization, cross/recombination, mutation, and selection evaluate the suitability of the gene before selection. It is a binary string except for. As described further below, in some embodiments, when a fit function (i.e., three exponential functions) needs to be evaluated for each individual of the entire population prior to selection, the binary representation is temporarily reverted to the codon string. It should be converted.

블록 (108)에서, 전자 장치는 초기 모집단 세트에 기초하여 컴퓨터-지원 NSGA-III 알고리즘 또는 이들의 변형을 사용하여 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하며, 이에 의해 단백질을 발현할 수 있는 복수의 최적화된 핵산 서열을 수득한다.In block 108, the electronic device performs optimization of the harmonic index, codon context index, and outlier index using the computer-assisted NSGA-III algorithm or a variant thereof based on the initial population set, whereby the protein may be expressed. A number of optimized nucleic acid sequences are obtained.

항상 또는 일부 실시형태에서, 후보 핵산 서열의 조화 지수는 복수의 고도로 발현된 유전자와 후보 핵산 서열 (즉, 최적화 동안 후보 단백질을 인코딩하는 유전자) 사이의 동의어 코돈의 사용 빈도 분포의 일관성을 나타내며, 이는 특정 아미노산의 동의어 코돈의 계수를 할당하는 방법을 해결하는 데 도움이 된다. 후보 핵산 서열의 코돈 문맥 지수는 동의어 코돈을 적절한 위치에 배치하기 위한 척도이다. 후보 핵산 서열의 이상치 지수는 후보 핵산 서열에 대한 복수의 미리결정된 서열 특징의 부정적 효과의 척도이다.Always or in some embodiments, the harmony index of the candidate nucleic acid sequence indicates the consistency of the distribution of the frequency of use of the synonym codon between the plurality of highly expressed genes and the candidate nucleic acid sequence (i.e., the gene encoding the candidate protein during optimization), which It helps to solve how to assign the coefficients of synonym codons for specific amino acids. The codon context index of a candidate nucleic acid sequence is a measure for placing synonym codons in appropriate positions. The outlier index of a candidate nucleic acid sequence is a measure of the negative effect of a plurality of predetermined sequence features on the candidate nucleic acid sequence.

일부 실시형태에서, 블록 (106)에 도시된 바와 같이, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 상기 조화 지수를 최대화하는 단계; 상기 코돈 문맥 지수 최대화하는 단계; 및 상기 이상치 지수를 최소화하는 단계를 포함한다.In some embodiments, as shown in block 106, performing optimization of a harmonic index, a codon context index, and an outlier index comprises: maximizing the harmonic index; Maximizing the codon context index; And minimizing the outlier index.

최적화는 다중-목표 유전적 알고리즘을 사용함에 의해 수행될 수 있으며, 3가지 목표는 조화 지수를 최대화하는 것; 코돈 문맥 지수를 최대화하는 것; 및 이상치 지수를 최소화하는 것이다. 일부 실시형태에서, NSGA-III 알고리즘 또는 변형이 사용된다. 전통적인 유전적 알고리즘과 달리, NSGA-III의 모집단 구성원 간의 다양성의 유지는 다수의 잘-분산된 소정의 기준점을 제공하고 적응적으로 업데이트함에 의해 지원되며, 따라서 NSGA-III는 그 선택 조작자에 상당한 변화가 있다. 추가로 NSGA-III는 NSGA-II와 같은 다른 유전적 알고리즘에 비해 세 가지-목표에서 15가지-목표 최적화 문제를 해결하는 데 있어 그 효능을 입증한다. NSGA-III 알고리즘의 변형은 EliteNSGA-III 알고리즘, NSGA-II 기반 면역 알고리즘, MAM-MOIA 또는 MOLA를 포함한다. EliteNSGA-III 알고리즘은 2016년에 Amin Ibrahim 등에 의해 공표된 "ELITENSGA-III: AN IMPROVED EVOLUTIONARY MANY-OBJECTIVE OPTIMIZATION ALGORITHM" 표제의 출판물에 기술되어 있으며, 이는 그 전체적으로 본 명세서에 참조로 포함된다. 다양한 면역 알고리즘이 예를 들어 2010년 9월에 Guan-Chun Luh 등에 의해 공표된 "MOIA: MULTI-OBJECTIVE IMMUNE ALGORITHM" 표제의 출판물, 2007년에 Felipe Campelo 등에 의해 공표된 "OVERVIEW OF ARTIFICIAL IMMUNE SYSTEMS FOR MULTI-OBJECTIVE OPTIMIZATION" 표제의 출판물, 2010년 4월에 Zhi-Hua Hu에 의한 "A MULTIOBJECTIVE IMMUNE ALGORITHM BASED ON A MULTIPLE-AFFINITY MODEL" 표제의 출판물 및 2017년 7월 25일에 출원된 중국 특허 출원 번호 201710611752.5에 기술되어 있으며, 이는 그 전체적으로 본 명세서에 참조로 포함된다.Optimization can be performed by using a multi-goal genetic algorithm, the three goals are maximizing the harmonic index; Maximizing the codon context index; And to minimize the outlier index. In some embodiments, the NSGA-III algorithm or variant is used. Unlike traditional genetic algorithms, the maintenance of diversity among population members of NSGA-III is supported by providing and adaptively updating a number of well-distributed predetermined reference points, so NSGA-III has significant changes in its selection manipulators. There is. Additionally, NSGA-III demonstrates its efficacy in solving the three-goal to fifteen-goal optimization problem compared to other genetic algorithms such as NSGA-II. Variations of the NSGA-III algorithm include the EliteNSGA-III algorithm, the NSGA-II based immune algorithm, MAM-MOIA or MOLA. The EliteNSGA-III algorithm is described in a publication entitled "ELITENSGA-III: AN IMPROVED EVOLUTIONARY MANY-OBJECTIVE OPTIMIZATION ALGORITHM" published in 2016 by Amin Ibrahim et al., which is incorporated herein by reference in its entirety. Various immune algorithms are, for example, a publication entitled "MOIA: MULTI-OBJECTIVE IMMUNE ALGORITHM" published in September 2010 by Guan-Chun Luh et al. -OBJECTIVE OPTIMIZATION", published in April 2010 by Zhi-Hua Hu under the heading "A MULTIOBJECTIVE IMMUNE ALGORITHM BASED ON A MULTIPLE-AFFINITY MODEL" and Chinese Patent Application No. 201710611752.5 filed July 25, 2017 And are incorporated herein by reference in their entirety.

NSGA-III 알고리즘 (또는 유사한 유전적 알고리즘)의 작동에 따라, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 초기 모집단 세트의 각각의 초기 후보 핵산 서열에 대하여, 각각의 초기 후보 핵산 서열에 대한 각각의 조화 지수 값, 각각의 코돈 문맥 지수 값, 및 각각의 이상치 지수 값을 계산하는 단계; 상기 계산하는 단계에 기초하여, 복수의 초기 후보 핵산 서열에 상응하는 복수의 적합성 값을 할당하는 단계; 복수의 적합성 값에 기초하여, 복수의 초기 후보 핵산 서열을 분류하는 단계; 및 정렬된 복수의 초기 후보 핵산 서열의 서브세트를 후속하는 모집단 세트 (즉, 2차 반복에서 사용됨)에 포함시키는 단계를 포함한다.According to the operation of the NSGA-III algorithm (or similar genetic algorithm), the step of performing optimization of the harmonic index, codon context index, and outlier index is for each initial candidate nucleic acid sequence of the initial population set, each initial candidate nucleic acid. Calculating each harmonic index value, each codon context index value, and each outlier index value for the sequence; Allocating a plurality of suitability values corresponding to a plurality of initial candidate nucleic acid sequences based on the calculating step; Classifying a plurality of initial candidate nucleic acid sequences based on the plurality of suitability values; And incorporating a subset of the aligned plurality of initial candidate nucleic acid sequences into a subsequent population set (ie, used in a second iteration).

NSGA-III 알고리즘 (또는 유사한 유전적 알고리즘)의 작동에 따라, 본 방법은 초기 모집단에 기초하여 자손 모집단을 생성하는 단계; 및 후속하는 모집단 세트 (즉, 2차 반복에서 사용됨)에 자손 모집단을 포함시키는 단계를 포함한다. 일부 실시형태에서, 자손 모집단은 이진수 토너먼트 선택, 교차/재조합, 돌연변이 또는 이들의 임의의 조합을 통해 생성된다.In accordance with the operation of the NSGA-III algorithm (or similar genetic algorithm), the method comprises the steps of generating a progeny population based on the initial population; And including the progeny population in a subsequent population set (ie, used in the second iteration). In some embodiments, the progeny population is generated through binary tournament selection, cross/recombination, mutation, or any combination thereof.

일부 실시형태에서, 초기 모집단 세트와 후속하는 모집단 세트 (즉, 2차 반복에서 사용됨)는 동일한 크기의 것이다.In some embodiments, the initial population set and the subsequent population set (ie, used in the second iteration) are of the same size.

NSGA-III 알고리즘 (또는 유사한 유전적 알고리즘)의 작동에 따라, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 최적화를 수행하는 단계는 복수의 반복을 포함한다. 복수의 반복의 i-번째 반복 (여기서 i는 2, 3, 4, 5, 6... n일 수 있음)은: (i-1)번째 반복에 상응하는 핵산 서열의 모집단 세트를 수신하는 단계; (i-1)번째 반복에 상응하는 모집단 세트의 각 핵산 서열을 비-우점 수준과 연관시키는 단계; 연관된 비-우점 수준에 기초하여 (i-1)번째 반복에 상응하는 모집단 세트에서 핵산 서열을 분류하는 단계; i-번째 반복에 상응하는 모집단 집합을 생성하는 단계로, 여기서 i-번째 반복에 상응하는 모집단 집합은 (i-1)번째 반복에 상응하는 분류된 핵산 서열의 서브세트 및 (i-1)번째 반복에 상응하는 분류된 핵산 서열에 기초하여 생성된 자손 모집단을 포함하는, 단계; 및 하나 이상의 종료 조건에 기초하여, i-번째 반복에 상응하는 모집단 세트를 사용하여 (i+1)번째 반복으로 진행할지 여부를 결정하는 단계를 포함한다.Depending on the operation of the NSGA-III algorithm (or similar genetic algorithm), the step of performing optimization of the harmonic index, codon context index and outlier index includes a plurality of iterations. The i-th iteration of the plurality of iterations (where i can be 2, 3, 4, 5, 6...n) comprises: receiving a population set of nucleic acid sequences corresponding to the (i-1)th iteration ; associating each nucleic acid sequence of the population set corresponding to the (i-1) th iteration with a non-dominant level; Classifying nucleic acid sequences in the population set corresponding to the (i-1) th iteration based on the associated non-dominant level; generating a population set corresponding to the i-th iteration, wherein the population set corresponding to the i-th iteration is a subset of the classified nucleic acid sequence corresponding to the (i-1)th iteration and the (i-1)th iteration Comprising a progeny population generated based on the classified nucleic acid sequence corresponding to the repeat; And determining whether to proceed to the (i+1) th iteration using the population set corresponding to the i-th iteration, based on the one or more termination conditions.

일부 실시형태에서, 각 핵산 서열을 비-우점 수준과 연관시키는 단계는 (i-1)번째 반복에 상응하는 모집단 세트의 각 핵산 서열에 대해, 각각의 조화 지수 값, 각각의 코돈 문맥 지수 값 및 각각의 이상치 지수 값을 포함한다.In some embodiments, associating each nucleic acid sequence with a non-dominant level comprises, for each nucleic acid sequence in the population set corresponding to the (i-1) th iteration, each harmony index value, each codon context index value, and Each outlier index value is included.

NSGA-III 알고리즘의 작동에 따라, 일부 실시형태에서, i-번째 반복에 상응하는 모집단 세트를 생성하는 단계는 (i-1)번째 반복에 상응하는 분류된 핵산 서열의 적어도 하나의 핵산 서열을 복수의 미리결정된 기준점 중 하나와 연관시키는 단계를 포함한다.In accordance with the operation of the NSGA-III algorithm, in some embodiments, generating a population set corresponding to the i-th iteration comprises a plurality of at least one nucleic acid sequence of the classified nucleic acid sequence corresponding to the (i-1)th iteration. And associating with one of the predetermined reference points.

일부 실시형태에서, 본 방법은 최적화 알고리즘에 대한 하나 이상의 매개변수를 설정하는 단계를 더 포함하며, 여기서 하나 이상의 매개변수는 모집단 세트의 크기, 분할의 수, 시뮬레이션된 이진수 교차에 대한 분포 지수, 시뮬레이션된 이진수 교차에 대한 교차 비율, 비트 플립 돌연변이에 대한 돌연변이 비율, 비트 플립 돌연변이에 대한 분포 지수, 또는 이들의 임의의 조합을 포함한다.In some embodiments, the method further comprises setting one or more parameters for the optimization algorithm, wherein the one or more parameters are the size of the population set, the number of partitions, the distribution index for the simulated binary intersection, the simulation The crossover ratio for the resulting binary crossover, the mutation ratio for the bit flip mutation, the distribution index for the bit flip mutation, or any combination thereof.

일부 실시형태에서, 최적화 동안에 조화 지수, 코돈 문맥 지수 및 이상치 지수 중 적어도 하나는 하나 이상의 데이터베이스로부터의 복수의 고도로-발현된 유전자의 하나 이상의 특성에 기초하여 계산된다. 일부 실시형태에서, 하나 이상의 특성은 코돈 빈도, 동의어 코돈 빈도, 코돈 쌍 빈도 또는 이들의 조합을 포함한다. 고도로-발현된 유전자의 이들 특성을 사용하여 아래 식에 나타난 바와 같은 주어진 후보 핵산 서열에 대한, 조화 지수, 코돈 문맥 지수 및 이상치 지수를 계산할 수 있다.In some embodiments, during optimization at least one of a harmonic index, a codon context index, and an outlier index is calculated based on one or more characteristics of a plurality of highly-expressed genes from one or more databases. In some embodiments, the one or more characteristics include codon frequencies, synonym codon frequencies, codon pair frequencies, or combinations thereof. These properties of highly-expressed genes can be used to calculate the harmonic index, codon context index, and outlier index for a given candidate nucleic acid sequence as shown in the equation below.

일부 실시형태에서, 블록 (102)에 표시된 바와 같이, 고도로-발현된 유전자의 이들 특성은 개인 또는 공용 데이터베이스를 기반으로 확인된다. 예를 들어, 데이터베이스(들)는 회사의 오더 시스템에서 수집된 이전에 성공적으로 최적화된 오더를 포함하는 독점 데이터베이스일 수 있다. 다른 예로, 데이터는 다양한 배양 조건 하에서 RNA-seq 데이터의 데이터 마이닝의 방식에 의해 수득될 수 있으며, 이는 공개 정보일 수 있다. 데이터 처리는 코돈 빈도, 동의어 코돈 빈도, 코돈 쌍 빈도를 포함한 고도로-발현된 유전자의 기본 정보를 얻기 위한 목적으로 수행된다.In some embodiments, as indicated at block 102, these properties of the highly-expressed gene are identified based on private or public databases. For example, the database(s) may be a proprietary database containing previously successfully optimized orders collected in the company's order system. As another example, data may be obtained by a method of data mining of RNA-seq data under various culture conditions, which may be public information. Data processing is performed for the purpose of obtaining basic information of highly-expressed genes including codon frequencies, synonym codon frequencies, and codon pair frequencies.

일부 실시형태에서, 후보 핵산 서열의 조화 지수는 식: H = 1-D(F _hs ,F _ts )에 기초하여 계산되며, 상기에서 D()는 거리 함수를 나타내고; 여기서 F_hs는 복수의 고도로 발현된 유전자 내의 복수의 아미노산의 동의어 코돈의 빈도를 포함하는 벡터를 포함하고; F_ts는 후보 핵산 서열의 코딩 유전자 내에 복수의 아미노산의 동의어 코돈의 빈도를 포함하는 벡터를 포함한다.In some embodiments, the harmony index of the candidate nucleic acid sequence is calculated based on the formula: H = 1-D( F _hs ,F _ts ), wherein D() represents a distance function; Wherein F _hs comprises a vector comprising the frequency of synonym codons of a plurality of amino acids in a plurality of highly expressed genes; F _ts includes a vector containing the frequency of synonym codons of a plurality of amino acids in the coding gene of the candidate nucleic acid sequence.

일부 실시형태에서, 이상치 지수는 식

일부 실시형태에서, 복수의 미리결정된 특징은 선택된 발현 시스템에 기초하여 확인된다. 다양한 발현 시스템의 경우, 불리한 인자의 목록이 변경될 수 있으며, 그 영향이나 가중치도 또한 동등하지 않다.In some embodiments, a plurality of predetermined features are identified based on the selected expression system. For various expression systems, the list of adverse factors may change, and their influences or weights are also not equivalent.

블록 (110)에서, 본 방법은 c) 복수의 최적화된 핵산 서열의 최적화된 핵산 서열로부터 미리결정된 불리한 하위서열 또는 모티프를 제거하는 단계를 추가로 포함한다. 일부 실시형태에서, 미리결정된 불리한 하위서열 또는 모티프를 제거하는 단계는 최적화된 핵산 서열에서 미리결정된 불리한 하위서열 또는 모티프를 확인하는 단계; 확인된 미리결정된 불리한 하위서열 또는 모티프에 기초하여 복수의 동의어 코돈을 확인하는 단계; 최적화된 핵산 서열에서 확인된 미리결정된 불리한 하위서열로 치환하기 위해 복수의 동의어 코돈으로부터 동의어 코돈을 선택하는 단계를 포함한다.At block 110, the method further comprises c) removing a predetermined adverse subsequence or motif from the optimized nucleic acid sequence of the plurality of optimized nucleic acid sequences. In some embodiments, removing the predetermined adverse subsequence or motif comprises identifying a predetermined adverse subsequence or motif in the optimized nucleic acid sequence; Identifying a plurality of synonym codons based on the identified predetermined adverse subsequence or motif; Selecting a synonym codon from the plurality of synonym codons for substitution with a predetermined adverse subsequence identified in the optimized nucleic acid sequence.

일부 실시형태에서, 블록 (104)에 표시된 바와 같이, 미리결정된 불리한 하위서열 또는 모티프는 복수의 텍스트 부분의 분석 (예를 들어, 문헌의 자동 텍스트 마이닝 또는 수동 검사)에 기초하여 확인된다.In some embodiments, as indicated at block 104, a predetermined adverse subsequence or motif is identified based on an analysis of a plurality of text portions (eg, automatic text mining or manual inspection of the document).

일부 실시형태에서, 본 방법은 복수의 최적화된 핵산 서열 중 하나 이상의 최적화된 핵산 서열을 나타내는 출력을 제공하는 단계를 추가로 포함한다.In some embodiments, the method further comprises providing an output representative of one or more of the plurality of optimized nucleic acid sequences.

일부 실시형태에서, 숙주에서 단백질의 발현을 위해 핵산 서열을 최적화하기 위한 시스템이 제공되며, 상기 시스템은 하나 이상의 프로세서; 메모리; 및 하나 이상의 프로그램을 포함하며, 여기서 상기 하나 이상의 프로그램은 메모리에 저장되고 하나 이상의 프로세서에 의해 실행되도록 구성되며, 상기 하나 이상의 프로그램은 본 명세서에 기재된 임의의 방법을 수행하기 위한 명령을 포함한다.In some embodiments, a system is provided for optimizing a nucleic acid sequence for expression of a protein in a host, the system comprising: one or more processors; Memory; And one or more programs, wherein the one or more programs are stored in memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing any of the methods described herein.

일부 실시형태에서, 숙주 세포에서 단백질을 발현시키기 위한 방법이 제공되며, 상기 방법은 (a) 본 명세서에 기재된 임의의 방법을 사용하여 숙주 세포에서 단백질의 발현을 위한 최적화된 핵산 서열을 수득하는 단계, (b) 상기 최적화된 핵산 서열을 포함하는 핵산 분자를 합성하는 단계; (c) 상기 핵산 분자를 숙주 세포에 도입하여 재조합 숙주 세포를 수득하는 단계; 및 (d) 상기 재조합 숙주 세포를 상기 최적화된 핵산 서열로부터 단백질의 발현을 허용하는 조건 하에서 배양하는 단계를 포함한다.In some embodiments, a method for expressing a protein in a host cell is provided, the method comprising the steps of: (a) obtaining an optimized nucleic acid sequence for expression of the protein in the host cell using any of the methods described herein. , (b) synthesizing a nucleic acid molecule comprising the optimized nucleic acid sequence; (c) introducing the nucleic acid molecule into a host cell to obtain a recombinant host cell; And (d) culturing the recombinant host cell under conditions that allow expression of the protein from the optimized nucleic acid sequence.

도 2a는 본 발명의 일부 실시형태에 따라 숙주에서 단백질의 발현을 위한 서열 (예를 들어, 핵산 서열)을 최적화하기 위한 알고리즘을 구축하고 실행하기 위한 예시적인 파이프라인 (200)을 예시한다. 프로세스 (200)는, 예를 들어, 도 4에 도시된 하나 이상의 전자 장치를 사용하여 수행된다. 일부 예들에서, 프로세스 (200)는 클라이언트-서버 시스템을 사용하여 수행되고, 프로세스 (200)의 블록은 서버와 클라이언트 장치 사이에서 임의의 방식으로 분할된다. 다른 예에서, 프로세스 (200)의 블록은 서버 및/또는 다수의 클라이언트 장치 사이에서 분할된다. 따라서, 프로세스 (200)의 일부는 특정 장치에 의해 수행되는 것으로 본 명세서에서 기술되지만, 프로세스 (200)는 그렇게 제한되지 않음을 인식될 것이다. 다른 예들에서, 프로세스 (200)는 단지 단일 전자 장치 (예를 들어, 전자 장치 (400)) 또는 다수의 전자 장치를 사용하여 수행된다. 프로세스 (200)에서, 일부 블록은 선택적으로 조합되고, 일부 블록의 순서는 선택적으로 변경되고, 일부 블록은 선택적으로 생략된다. 일부 예에서, 추가 단계가 프로세스 (200)와 조합하여 수행될 수 있다.2A illustrates an exemplary pipeline 200 for building and executing an algorithm for optimizing a sequence (eg, a nucleic acid sequence) for expression of a protein in a host in accordance with some embodiments of the present invention. Process 200 is performed, for example, using one or more electronic devices shown in FIG. 4. In some examples, process 200 is performed using a client-server system, and the block of process 200 is partitioned in any way between the server and the client device. In another example, the block of process 200 is partitioned between a server and/or multiple client devices. Accordingly, while portions of process 200 are described herein as being performed by a particular device, it will be appreciated that process 200 is not so limited. In other examples, process 200 is performed using only a single electronic device (eg, electronic device 400) or multiple electronic devices. In process 200, some blocks are selectively combined, the order of some blocks is selectively changed, and some blocks are optionally omitted. In some examples, additional steps may be performed in combination with process 200.

데이터 수집 및 문헌 검토Data collection and literature review

도 2a를 참조하면, 블록 (202)에서, 복수의 고도로-발현된 유전자가 하나 이상의 데이터베이스로부터 확인될 수 있다. 데이터베이스는 공개 또는 비공개일 수 있다. 예를 들어, 데이터베이스(들)는 회사의 오더 시스템에서 수집된 이전에 성공적으로 최적화된 오더를 포함하는 독점 데이터베이스일 수 있다. 다른 예로, 데이터는 다양한 배양 조건 하에서 RNA-seq 데이터의 데이터 마이닝의 방식에 의해 수득될 수 있으며, 이는 공개 정보일 수 있다. 2A, at block 202, a plurality of highly-expressed genes may be identified from one or more databases. The database can be public or private. For example, the database(s) may be a proprietary database containing previously successfully optimized orders collected in the company's order system. As another example, data may be obtained by a method of data mining of RNA-seq data under various culture conditions, which may be public information.

블록 (204)에서, 고도로-발현된 유전자의 기본적 특성이 확인된다. 예시적인 실시형태에서 mRNA-seq 실험 및 데이터 분석은 표준 샘플에 대한 Illumina의 권장 mRNA-Seq 작업흐름에 따라 수행된다. 과정 중에 TruSeq Stranded mRNA Library Prep Kit가 라이브러리 준비를 위해 사용될 수 있고, NextSeq의 PE300이 서열화를 위해 이용될 수 있다. 연속적으로, 코돈 빈도, 동의어 코돈 빈도, 코돈 쌍 빈도를 포함한 고도로-발현된 유전자의 기본적인 정보를 얻기 위해 TopHat, Cufflinks 및 홈-메이드 스크립트를 통한 데이터 처리가 적용될 수 있다.In block 204, the basic characteristics of the highly-expressed gene are identified. In an exemplary embodiment, mRNA-seq experiments and data analysis are performed according to Illumina's recommended mRNA-Seq workflow for standard samples. During the process, the TruSeq Stranded mRNA Library Prep Kit can be used for library preparation, and NextSeq's PE300 can be used for sequencing. In succession, data processing through TopHat, Cufflinks and homemade scripts can be applied to obtain basic information of highly-expressed genes including codon frequencies, synonym codon frequencies, and codon pair frequencies.

블록 (206 및 208)에서, 예시적인 시스템은 또한 확립된 이점을 유지하기 위해 회피할 임의의 보고되고 검증된 불리한 특징들을 확인할 수 있다. 단백질 발현의 감소를 초래할 수 있는 부정적인 인자를 발견하기 위해, 시스템은 문헌 검토를 수행할 수 있다. 예를 들어, 자동 텍스트 마이닝 및/또는 수동 검사의 방식에 의해 보고된 발현-관련된 불리한 모티프 및 mRNA 특징이 다양한 숙주에 대해 확인될 수 있다.At blocks 206 and 208, the example system may also identify any reported and verified adverse features to avoid in order to maintain the established advantage. In order to find negative factors that can lead to a decrease in protein expression, the system can perform a literature review. For example, expression-related adverse motifs and mRNA characteristics reported by the method of automatic text mining and/or manual inspection can be identified for a variety of hosts.

최적화 알고리즘을 위한 주요 인자/적합성 함수Key factors/suitability functions for optimization algorithms

코딩 유전자의 발현은 전사, mRNA 전환, 번역 (개시, 프로모터 탈출, 연장 및 종결을 포함함) 및 후 번역 변형의 수준에 의존하는 다수의 단계를 갖는다. 그럼에도 불구하고 코돈 최적화는 조합의 문제로 단순화되고 3가지 직관적인 조작: (i) 처음에 특정 아미노산의 동의어 코돈 계수를 할당하는 방법, (ii) 동의어 코돈의 가장 적합한 위치에 동의어 코돈을 배치하는 방법, 및 (iii) 불리하지만 우발적으로 생성된 하위서열 및/또는 모티프를 감소시키는 방법으로 그룹화될 수 있다.Expression of the coding gene has a number of steps depending on the level of transcription, mRNA conversion, translation (including initiation, promoter escape, extension and termination) and post-translational modification. Nevertheless, codon optimization is simplified to a problem of combination and has three intuitive manipulations: (i) how to initially assign the synonym codon coefficient of a specific amino acid, and (ii) how to place the synonym codon at the most suitable position of the synonym codon. , And (iii) unfavorable but accidentally generated subsequences and/or motifs.

본 발명의 일부 실시형태에 따르면, 3가지 상기-언급된 조작에 각각 일치하고 단백질 발현과 높은 상관관계가 있는 3가지 핵심 인자인: 조화 지수, 코돈 문맥 지수 및 이상치 지수가 아래에 제공된다. 아래에서 논의된 바와 같이, 이들 3가지 지표는 다양한 데이터 공급원으로부터 수집된 상기-언급된 기초 데이터를 기반으로 계산된다.According to some embodiments of the present invention, the three key factors that correspond to each of the three above-mentioned manipulations and have a high correlation with protein expression are provided below: the harmonic index, the codon context index, and the outlier index. As discussed below, these three indicators are calculated based on the above-mentioned basic data collected from various data sources.

도 2a를 참조하면, 블록 (210)에서, 두 단계 (212 및 214)를 포함하는 최적화 절차가 수행된다. 블록 (212)에 도시된 단계 1에서, 시스템은 NSGA-III 알고리즘 또는 그 변형에 기초하여 다중-목표 코돈 최적화를 수행하며, 이는 조화 지수를 최대화하는 단계, 코돈 문맥 지수를 최대화하는 단계 및 이상치 지수를 최소화하는 단계를 포함한다.Referring to FIG. 2A, at block 210, an optimization procedure comprising two steps 212 and 214 is performed. In step 1 shown in block 212, the system performs a multi-target codon optimization based on the NSGA-III algorithm or a variant thereof, which maximizes the harmonic index, the codon context index, and the outlier index. It includes the step of minimizing.

1. 조화 지수1. Harmony index

조화 지수는 고도로 발현된 유전자와 후보 핵산 서열 사이의 동의어 코돈의 사용 빈도 분포의 일관성을 나타낸다. 후보 핵산 서열은 "다중-목표 최적화 알고리즘"이라는 제목하에 자세히 기술되어 있는 최적화 알고리즘의 적어도 1회 반복에서 평가된 후보 단백질을 인코딩하는 유전자를 지칭한다. 일부 실시형태에서, 조화 지수는 다음과 같이 정의된다:The harmony index represents the consistency of the distribution of the frequency of use of synonym codons between the highly expressed gene and the candidate nucleic acid sequence. A candidate nucleic acid sequence refers to a gene encoding a candidate protein evaluated in at least one iteration of the optimization algorithm, which is described in detail under the heading “Multi-Object Optimization Algorithm”. In some embodiments, the harmonic index is defined as follows:

상기 식에서, H는 조화 지수이고, D()는 유클리드 거리, 코사인 거리, 맨하탄 거리 또는 민코프스키 거리일 수 있으나 이에 제한되지는 않는 두 벡터 간의 거리 함수이다. F_hs는 고도로 발현된 유전자 내 18개 아미노산 (Met/M 및 Trp/W 제외)의 동의어 코돈의 빈도를 포함한 벡터이고, 64개 코돈으로부터 3개의 정지 코돈 (즉, TAA, TAG 및 TGA), 아미노산 Met/M의 코돈 (즉, ATG) 및 아미노산 Trp/W의 코돈 (즉, TGG)의 제거에 기인하여 59개 요소를 포함한다. F_ts는 코돈 최적화를 대기하는 후보 단백질의 코딩 유전자 (즉, 후보 핵산 서열) 내의 18개 아미노산의 동의어 코돈의 빈도를 포함하는 벡터이다.In the above equation, H is a harmonic index, and D() is a distance function between two vectors, which may be a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance, but is not limited thereto. F _hs is a vector containing the frequency of synonym codons of 18 amino acids (except Met/M and Trp/W) in a highly expressed gene, 3 stop codons (i.e., TAA, TAG and TGA), amino acids from 64 codons It contains 59 elements due to the removal of the codon of Met/M (i.e. ATG) and the codon of amino acid Trp/W (i.e. TGG). F _ts is a vector containing the frequency of the synonym codon of 18 amino acids in the coding gene (ie, candidate nucleic acid sequence) of the candidate protein awaiting codon optimization.

코돈 적응 지수 (CAI)와 비교하여, 조화 지수는 동의어 코돈의 분포 (즉, 사용 균형/장입 균형)에 집중하지만 가장 빈번하게 발생하는 최상 1개 동의어 코돈을 유일하게 선택하는 단계를 통해 항상 최대 CAI를 목표로 하는 것은 아니다.Compared to the Codon Adaptation Index (CAI), the Harmonic Index focuses on the distribution of synonym codons (i.e. used balance/charging balance), but always the maximum CAI through the step of choosing the best one most frequently occurring synonym codon only It is not aimed at.

일부 실시형태에서, 조화 지수의 계산 동안 사용되는 고도로 발현된 유전자 또는 후보 핵산 서열의 특정 동의어 코돈의 빈도는 다음과 같이 정의된다:In some embodiments, the frequency of a particular synonym codon of a highly expressed gene or candidate nucleic acid sequence used during calculation of the harmony index is defined as follows:

조화 지수는 코돈 사용을 고려하지만, 동의어 코돈의 빈도 분포에만 관심이 있는 반면, 18개 아미노산 중 하나의 상이한 유전자좌에 이들의 할당은 여전히 문제이다 (즉, 동일한 아미노산의 동의어 코돈의 순서 설정). 따라서, 동의어 코돈에 대한 대략적으로 최적의 순위화를 선택하기 위해 동의어 코돈 짝짓기를 통해 이 병목현상을 해결하기 위해서는 아래 기술된 코돈 문맥 지수가 요구된다.Harmony indices take into account codon use, but are only concerned with the frequency distribution of synonym codons, while their assignment to different loci of one of the 18 amino acids is still a problem (i.e., ordering of synonym codons of the same amino acid). Therefore, the codon context index described below is required to resolve this bottleneck through synonym codon matching to select the roughly optimal ranking for the synonym codon.

2. 코돈 문맥 지수2. Codon context index

후보 핵산 서열의 코돈 문맥 지수는 동의어 코돈을 적절한 위치에 배치하기 위한 척도이다. 일부 실시형태에서, 코돈 문맥 지수는 다음과 같이 정의된다:The codon context index of a candidate nucleic acid sequence is a measure for placing synonym codons in appropriate positions. In some embodiments, the codon context index is defined as follows:

상기 식에서, CC는 코돈 문맥 지수를 나타내고, D()는 유클리드 거리, 코사인 거리, 맨하탄 거리 또는 민코프스키 거리일 수 있지만 이에 제한되지는 않는 두 벡터 사이의 거리 함수이다. F_hcc는 고도로 발현된 유전자 내에서 모든 종류의 두 연속 아미노산의 동의어 코돈 쌍의 빈도를 포함하는 벡터이다. 예를 들어, 아미노산 Phe/F는 두 개의 동의어 코돈, 즉 TTT와 TTC를 가지고; 아미노산 Lys/K는 또한 코돈으로서 AAA 및 AAG를 가지고; 그들의 동의어 코돈 쌍은 TTTAAA, TTTAAG, TTCAAA 및 TTCAAG를 포함하여 2 x 2 조합이어야 한다. 두 개의 아미노산 메티오닌/M 및 트립토판/W의 순열 (즉, MM, MW, WW 및 WM)에 대해 동의어 코돈 쌍이 존재하지 않기 때문에, CC의 길이는 61 x 61 마이너스 4이고 최종적으로는 3717이다. F_tcc는 후보 단백질의 코딩 유전자 (즉, 후보 핵산 서열) 내의 모든 종류의 두 개의 연속 아미노산의 동의어 코돈 쌍의 빈도를 포함하는 벡터이며, 이것의 길이는 또한 3717이다.In the above equation, CC denotes a codon context index, and D() is a distance function between two vectors, which may be a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance, but is not limited thereto. F _hcc is a vector containing the frequencies of synonym codon pairs of two consecutive amino acids of all kinds within a highly expressed gene. For example, the amino acid Phe/F has two synonymous codons, TTT and TTC; The amino acids Lys/K also have AAA and AAG as codons; Their synonym codon pair must be a 2 x 2 combination, including TTTAAA, TTTAAG, TTCAAA and TTCAAG. Since there are no synonymous codon pairs for the permutations of the two amino acids methionine/M and tryptophan/W (i.e. MM, MW, WW and WM), the length of CC is 61 x 61 minus 4 and finally 3717. F _tcc is a vector containing the frequency of synonym codon pairs of two consecutive amino acids of all kinds within the coding gene of the candidate protein (i.e., the candidate nucleic acid sequence), the length of which is also 3717.

코돈 문맥 지수의 계산 중에 사용되는 고도로 발현된 유전자 또는 후보 핵산 서열의 특정 동의어 코돈 쌍의 빈도는 다음과 같이 정의된다:The frequency of a particular synonym codon pair of a highly expressed gene or candidate nucleic acid sequence used during the calculation of the codon context index is defined as follows:

3. 이상치 지수3. Outlier Index

이상치 지수는 단백질 발현에 대한 확인된 복수의 서열 특징의 부정적인 영향을 평가하기 위해 가중 함수에 의해 계산된 척도이다. 일부 실시형태에서, 이상치 지수는 다음과 같이 정의된다:Outlier index is a measure calculated by a weighting function to assess the negative impact of a plurality of identified sequence features on protein expression. In some embodiments, the outlier index is defined as follows:

상기 식에서 N은 확인된 복수의 서열 인자의 수이고 N>1이다. f_i(x)는 확인된 N 서열 특징의 i-번째 서열 인자의 페널티 스코어링 함수를 나타내고; 그리고 wi는 f_i(x)에 주어진 상대적 가중치를 나타낸다. 따라서, 최적화된 유전자는 최대로 낮은 이상치 지수를 가져야 한다.In the above formula, N is the number of a plurality of identified sequence factors and N>1. f _i (x) represents the penalty scoring function of the i-th sequence factor of the identified N sequence feature; And wi denotes the relative weight given to _{f i (x).} Therefore, the optimized gene should have the lowest outlier index.

일부 실시형태에서, 복수의 서열 인자는 도 2a에 도시된 단계들 (202, 204 및 208) 중 하나 이상을 통해 확인될 수 있다. 일부 실시형태에서, 복수의 서열 인자는 하기에 상세히 설명된 GC-함량, CIS 요소, 반복적 요소, RNA 스플라이싱 부위, 리보솜 결합 서열, mRNA의 최소 자유 에너지를 함유하지만, 이에 제한되지는 않는다.In some embodiments, a plurality of sequence factors may be identified through one or more of steps 202, 204, and 208 shown in FIG. 2A. In some embodiments, the plurality of sequence factors contain, but are not limited to, the GC-content, CIS elements, repetitive elements, RNA splicing sites, ribosome binding sequences, minimal free energy of mRNA, detailed below.

3(a). 3(a). mRNA의mRNA 최소 자유 에너지 ( Minimum free energy ( MFEMFE ))

개시 코돈의 하류에 위치하는 mRNA의 잠재적인 강력한 스템-루프 2차 구조는 리보솜 복합체의 이동을 방해할 수 있고, 따라서 번역을 늦추고 번역 효율을 감소시킬 수 있다. mRNA의 안정된 2차 구조는 리보솜 복합체를 mRNA에서 떨어지게 하고 번역의 조기 종료를 초래하게 할 수도 있다. Mfold, RNAfold 및 RNAstructure를 포함하여 자유 에너지 계산 및 2차 구조 예측을 위한 몇 가지 방법이 있다. 본 발명의 실시형태에 따르면, 낮은 자유 에너지 (△G < -18 Kcal/mol) 또는 긴 상보적 스템 (> 10 bp)를 갖는 mRNA의 국소 2차 구조는 효율적인 번역을 위해 너무 안정한 것으로 정의된다. 유전자 서열은 바람직하게는 국소 구조가 그렇게 안정적이지 않도록 최적화된다. mRNA의 5'-UTR 및 3'-UTR의 둘 모두는 바람직하게는 mRNA 구조 자유 에너지 계산 및 2차 구조 예측을 위해 고려된다.Potentially strong stem-loop secondary structure of mRNA located downstream of the initiation codon can interfere with the migration of the ribosome complex, thus slowing translation and reducing translation efficiency. The stable secondary structure of the mRNA may cause the ribosome complex to dislodge from the mRNA and lead to premature termination of translation. There are several methods for free energy calculation and secondary structure prediction, including mfold, RNAfold and RNAstructure. According to an embodiment of the invention, the local secondary structure of an mRNA with a low free energy (ΔG <-18 Kcal/mol) or a long complementary stem (> 10 bp) is defined as too stable for efficient translation. The gene sequence is preferably optimized so that the local structure is not so stable. Both of the 5'-UTR and 3'-UTR of the mRNA are preferably taken into account for the calculation of the mRNA structure free energy and prediction of the secondary structure.

일부 실시형태에서, 너무 안정적인 것으로 고려되는 2차 구조는 더 높은 페널티와 연관된다. 더 높은 페널티 점수를 제공하는 데 사용된 가중치는 유연하다.In some embodiments, secondary structures that are considered too stable are associated with a higher penalty. The weights used to give a higher penalty score are flexible.

3(b). 3(b). GCGC -함량-content

mRNA의 GC 함량이 또한 바람직하게 고려된다. GC%에 대한 이상적인 범위는 대략 30-70%이다. 높은 GC-함량은 mRNA가 강한 줄기-루프 2차 구조를 형성하도록 할 것이다. 또한 PCR 증폭 및 유전자 클로닝에 문제를 일으킬 것이다. 표적 서열의 높은 GC-함량은 바람직하게는 약 50-60%가 되도록 코돈 축중을 사용하여 (예를 들어, 이진수 문자열의 교차 및 돌연변이를 포함하는, NSGA-III 알고리즘의 동작 동안) 돌연변이된다.The GC content of the mRNA is also preferably considered. The ideal range for GC% is approximately 30-70%. The high GC-content will allow the mRNA to form a strong stem-loop secondary structure. It will also cause problems with PCR amplification and gene cloning. The high GC-content of the target sequence is mutated (e.g., during operation of the NSGA-III algorithm, including crossover and mutation of binary strings) using codon weights to be preferably about 50-60%.

GC%에 대한 두 가지 상이한 측정이 있다. 하나는 전체 시퀀스를 따라 평균화되는 글로벌 GC%이며; 다른 하나는 더 유용하며 이는 고정된 크기 (예를 들어, 60 bp)의 이동된 "윈도우" 내에서 계산된 국소 GC%이다. 본 발명의 실시형태에 따르면, 국소 GC%는 약 35-65%로 최적화된다.There are two different measures of GC%. One is the global GC% averaged along the entire sequence; The other is more useful, which is the local GC% calculated within a fixed size (eg, 60 bp) shifted “window”. According to an embodiment of the present invention, the local GC% is optimized to about 35-65%.

3(c). 불안정한 인자 (예를 들어, 3(c). Unstable factor (for example, CisCis -작용 -Action mRNAmRNA 불안정화 모티프, Destabilization motif, RNaseRNase 스플라이싱 사이트 및 반복 요소, 등) Splicing sites and repeating elements, etc.)

mRNA 분해를 감소 또는 최소화하거나 mRNA의 안정성을 증가시키고 따라서 mRNA의 전환 시간을 감소시키기 위해, AU-풍부 요소 (ARE) 및 RNase 인식 및 절단 부위를 포함하지만 이에 제한되지 않는 시스-작용 mRNA 불안정화 모티프는 바람직하게는 유전자 서열로부터 돌연변이되거나 결실된다. AUUUA (서열 번호: 1)의 코어 모티프를 갖는 AU-풍부 요소 (ARE)는 일반적으로 mRNA의 3' 비번역된 영역에서 발견된다. mRNA cis-요소의 또 다른 예는 서열 모티프 TGYYGATGYYYYY (서열 번호: 2)로 구성되며, 여기서 Y는 T 또는 C를 나타낸다. RNase 인식 서열은 RNase E 인식 서열을 포함하지만 이에 제한되지는 않는다. 부족한 RNase를 갖는 숙주 균주가 또한 단백질 발현에 사용될 수 있다.Cis-acting mRNA destabilization motifs including, but not limited to, AU-rich elements (ARE) and RNase recognition and cleavage sites, in order to reduce or minimize mRNA degradation or increase the stability of the mRNA and thus reduce the conversion time of the mRNA It is preferably mutated or deleted from the gene sequence. The AU-rich element (ARE) with the core motif of AUUUA (SEQ ID NO: 1) is generally found in the 3'untranslated region of the mRNA. Another example of an mRNA cis-element consists of the sequence motif TGYYGATGYYYYY (SEQ ID NO: 2), where Y represents T or C. RNase recognition sequences include, but are not limited to, RNase E recognition sequences. Host strains with insufficient RNase can also be used for protein expression.

RNase 스플라이싱 부위는 RNA 스플라이싱이 다른 mRNA를 생성하도록 하고 따라서 원래의 mRNA 수준을 감소시킬 수 있다. RNase 스플라이싱 부위는 또한 바람직하게는 mRNA 수준을 유지하기 위해 비-기능적으로 돌연변이된다.The RNase splicing site allows RNA splicing to produce different mRNAs and thus can reduce the original mRNA level. The RNase splicing site is also preferably non-functionally mutated to maintain the mRNA level.

높은 수준의 mRNA를 생성하기 위해, 최적의 전사 프로모터 서열이 바람직하게는 유전자 서열에 사용된다. 대장균과 같은 원핵 숙주의 경우 강한 프로모터 중 하나는 T7 RNA 중합효소 (T7 RNAP)에 대한 T7 프로모터이다. 길거나 짧은 탠덤 단순 서열 반복 (SSR)의 일부 염기는 바람직하게는 반복을 파괴하여 중합효소 슬리피지를 감소시켜 조기 단백질 또는 단백질 돌연변이를 감소시키도록 코돈 축중을 사용하여 돌연변이된다.In order to produce high levels of mRNA, an optimal transcriptional promoter sequence is preferably used for the gene sequence. For prokaryotic hosts such as E. coli, one of the strong promoters is the T7 promoter for T7 RNA polymerase (T7 RNAP). Some bases of long or short tandem simple sequence repeats (SSRs) are preferably mutated using codon degeneration to break the repeats to reduce polymerase slippage, thereby reducing premature protein or protein mutations.

mRNA 번역 및 결과적인 단백질 발현 수준에 영향을 미치는 추가 인자 및 매개변수가 있다. 이들 인자는 번역 개시부터 번역 종료를 통해 번역에 영향을 미친다. 리보솜은 리보솜 결합 부위 (RBS)에서 mRNA를 결합하여 번역을 개시한다. 리보솜은 이중-가닥 RNA에 결합하지 않기 때문에, 이 영역 주변의 국소 mRNA 구조는 바람직하게는 단일 가닥이고 어떤 안정한 2차 구조도 형성하지 않는다. Shine-Dalgarnon 서열이라고도 불리는 이 콜리와 같은 원핵 세포에 대한 컨센서스 RBS 서열인 AGGAGG (서열 번호: 3)는 바람직하게는 발현되어 지는 유전자의 번역 개시 부위 바로 전 몇 개의 염기에 위치된다. 그러나, 내부 리보솜 진입 부위 (IRES)는 바람직하게는 비-특이적 번역 개시를 피하기 위해 리보솜 결합을 방지하도록 돌연변이된다.There are additional factors and parameters that affect the level of mRNA translation and the resulting protein expression. These factors influence the translation from the initiation of the translation to the end of the translation. Ribosomes initiate translation by binding mRNA at the ribosome binding site (RBS). Since ribosomes do not bind double-stranded RNA, the local mRNA structure around this region is preferably single-stranded and does not form any stable secondary structure. AGGAGG (SEQ ID NO: 3), a consensus RBS sequence for prokaryotic cells such as E. coli, also called the Shine-Dalgarnon sequence, is preferably located at several bases just before the translation initiation site of the gene to be expressed. However, the internal ribosome entry site (IRES) is preferably mutated to prevent ribosome binding to avoid non-specific translation initiation.

상기-언급된 인자들의 상세한 설명은, 예를 들어, 2018년 5월에 공표된 Saeid Kadkhodaei 등의 "CIS/TRANSGENE OPTIMIZATION: SYSTEMATIC DISCOVERY OF NOVEL GENE EXPRESSION USING BIOINFORMATICS AND COMPUTATIONAL BIOLOGY APPROACHES" 표제의 출판물, 2014년 7월에 공표된 Timothy J Gingerich 등의 "AU-RICH ELEMENTS AND THE CONTROL OF GENE EXPRESSION THROUGH MRNA STABILITY" 표제의 출판물, 2017년 10월에 공표된 Tala Bakheet의 "ARED-PLUS: AN UPDATED AND EXPANDED DATABASE OF AU-RICH ELEMENT-CONTAINING MRNAS AND PRE-MRNAS" 표제의 출판물, 1995년에 공표된 Shuang Zhang 등의 "IDENTIFICATION AND CHARACTERIZATION OF A SEQUENCE MOTIF INVOLVED IN NONSENSE-MEDIATED MRNA DECAY" 표제의 출판물, 2002년에 공표된 Jiong Ma 등의 "CORRELATIONS BETWEEN SHINE-DALGARNO SEQUENCES AND GENE FEATURES SUCH AS PREDICTED EXPRESSION LEVELS AND OPERON STRUCTURES"" 표제의 출판물, 2013년 12월에 공표된 Esther Y.C. Koh 등의 "AN INTERNAL RIBOSOME ENTRY SITE (IRES) MUTANT LIBRARY FOR TUNING EXPRESSION LEVEL OF MULTIPLE GENES IN MAMMALIAN CELLS" 표제의 출판물에서 찾아 볼 수 있으며, 이는 그 전체적으로 본 명세서에 참조로 포함된다.Detailed descriptions of the above-mentioned factors can be found, for example, in a publication entitled "CIS/TRANSGENE OPTIMIZATION: SYSTEMATIC DISCOVERY OF NOVEL GENE EXPRESSION USING BIOINFORMATICS AND COMPUTATIONAL BIOLOGY APPROACHES" published in May 2018, 2014 A publication titled "AU-RICH ELEMENTS AND THE CONTROL OF GENE EXPRESSION THROUGH MRNA STABILITY" by Timothy J Gingerich, published in July, and "ARED-PLUS: AN UPDATED AND EXPANDED DATABASE OF" by Tala Bakheet, published in October 2017. A publication entitled "AU-RICH ELEMENT-CONTAINING MRNAS AND PRE-MRNAS", published in 2002 by Shuang Zhang et al. entitled "IDENTIFICATION AND CHARACTERIZATION OF A SEQUENCE MOTIF INVOLVED IN NONSENSE-MEDIATED MRNA DECAY", published in 2002. A publication entitled "CORRELATIONS BETWEEN SHINE-DALGARNO SEQUENCES AND GENE FEATURES SUCH AS PREDICTED EXPRESSION LEVELS AND OPERON STRUCTURES" by Jiong Ma et al. LIBRARY FOR TUNING EXPRESSION LEVEL OF MULTIPLE GENES IN MAMMALIAN CELLS", which is incorporated herein by reference in its entirety.

다양한 발현 시스템의 경우, 불리한 인자의 목록이 변경될 수 있으며, 그 영향 또는 가중치 또한 동일하지 않다. 따라서 f_i(x)와 그 가중치는 다양한 발현 시스템에 대해 동적으로 변형될 수 있다. 예를 들어 GC-함량 및 MFE의 허용 범위의 설정 후, '범위를 벗어남'의 정도는 비율로 페널티를 야기할 것이다. 마찬가지로, 불안정한 인자의 발생 횟수는 페널티 점수로 직접적으로 기록될 수 있다.For various expression systems, the list of adverse factors may change, and their influences or weights are not the same. Thus, f _i (x) and its weight can be dynamically modified for various expression systems. For example, after setting the allowable range of GC-content and MFE, the degree of'out of range' will result in penalties as a percentage. Likewise, the number of occurrences of an unstable factor can be recorded directly as a penalty score.

후보 핵산 서열에 대한 이상치 지수가 높더라도, 후보 서열은 전체 모집단의 다양성을 유지하기 위해 반복이 생존할 가능성이 여전히 있을 수 있음이 인식되어야 한다. 환언하면, 더 높은 이상치 지수 (즉, 페널티)는 단지 더 낮은 생존율을 초래할 수 있기 때문에 이상치 지수를 통한 불리한 모티프/특징 여과가 필수적인 것은 아니다. 대조적으로, NSGA-III 알고리즘의 반복이 완료된 후 (즉, 도 1의 단계 110 또는 도 2의 단계 214에서) 불리한 모티프/특징의 제거는 필수적이다.It should be appreciated that even if the outlier index for the candidate nucleic acid sequence is high, the candidate sequence may still have the potential for the repeats to survive to maintain the diversity of the entire population. In other words, filtering of adverse motifs/features through the outlier index is not necessary as a higher outlier index (i.e., penalty) can only result in a lower survival rate. In contrast, after the iteration of the NSGA-III algorithm is complete (i.e., in step 110 of FIG. 1 or step 214 of FIG. 2) removal of the adverse motif/feature is essential.

결론적으로, 본 발명은 조화 지수와 코돈 문맥 지수의 값을 최대화함에 의해 긍정적인 효과를 증진하는 것을 시도할 뿐만 아니라 이상치 지수를 최소화함에 의해 불리한 영향을 피하기 위해 최선을 다한다.In conclusion, the present invention not only attempts to enhance the positive effect by maximizing the values of the harmonic index and the codon context index, but also does its best to avoid adverse effects by minimizing the outlier index.

다중-목표 (예를 들어, 2 초과의 목표) 최적화 알고리즘Multi-goal (e.g., more than 2 goals) optimization algorithm

본 발명은 3개의 포괄적인 목표의 최적화 작업이므로, 다중-목표 유전적 알고리즘이 사용될 수 있다. 일부 실시형태에서, NSGA-III 알고리즘 또는 그의 변형 예컨대 EliteNSGA-III (또한 K. Deb에 의해 제시됨)가 유전적 알고리즘의 고전적 프레임워크의 선택 조작 동안 모집단 다양성을 유지함에 의해 다중-목표 최적화 문제를 해결하는 데 있어 이들의 이점에 기인하여 사용될 수 있다.Since the present invention is an optimization task of three generic targets, a multi-target genetic algorithm can be used. In some embodiments, the NSGA-III algorithm or a variant thereof such as EliteNSGA-III (also presented by K. Deb) solves the multi-target optimization problem by maintaining population diversity during selection manipulation of the classical framework of the genetic algorithm. It can be used due to their advantage in doing so.

NSGA-III는 2014년 Kalyanmoy Deb과 Himanshu Jain에 의해 제안되었다. 그것은 비-우점적이지만 제공된 기준점의 세트에 가까운 모집단 구성원을 강조하는 NSGA-II 프레임워크를 따르는 기준-점-기반 다중-목표 진화 알고리즘이다. NSGA-III는 NSGA-II와 같은 다른 유전적 알고리즘에 비해 3-목표에서 15-목표 최적화 문제를 해결하는 데 있어 그 효능을 입증한다. 기존의 유전적 알고리즘과 달리, NSGA-III에서 모집단 구성원 간의 다양성의 유지는 다수의 잘-확산된 미리정의된 기준점을 제공하고 적응적으로 업데이트함에 의해 지원되고 따라서 NSGA-III는 그 선택 작동자에 상당한 변화가 있다.NSGA-III was proposed in 2014 by Kalyanmoy Deb and Himanshu Jain. It is a reference-point-based multi-goal evolutionary algorithm that is non-dominant but follows the NSGA-II framework that emphasizes population members close to a given set of reference points. NSGA-III demonstrates its efficacy in solving the 3-goal to 15-goal optimization problem compared to other genetic algorithms such as NSGA-II. Unlike conventional genetic algorithms, the maintenance of diversity among population members in NSGA-III is supported by providing and adaptively updating a number of well-diffused pre-defined reference points, and thus NSGA-III is dependent on its selective effector. There are significant changes.

NSGA-III 알고리즘은 2014년 8월에 공표된 Kalyanmoy Deb 등의 "An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints" 표제의 출판물에 기술되어 있으며, 이는 그 전체적으로 본 명세서에 참고로 포함된다. 관련된 NSGA-II 알고리즘은 2002년 8월에 공표된 Kalyanmoy Deb 등의 "A FAST AND ELITIST MULTIOBJECTIVE GENETIC ALGORITHM: NSGA-II" 표제의 출판물에 기술되어 있으며, 이는 그 전체적으로 본 명세서에 참고로 포함된다.The NSGA-III algorithm is described in a publication entitled "An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints" by Kalyanmoy Deb et al. published in August 2014. , Which is incorporated herein by reference in its entirety. The relevant NSGA-II algorithm is described in a publication entitled "A FAST AND ELITIST MULTIOBJECTIVE GENETIC ALGORITHM: NSGA-II" by Kalyanmoy Deb et al. published in August 2002, which is incorporated herein by reference in its entirety.

NSGA-III의 구현 동안, 코돈 목록/배열/벡터가 아닌, 이진수 문자열이 핵산 서열을 나타내는 데이터 구조로 선택되고, 모집단 초기화, 교차/재조합, 돌연변이를 포함한 일반 유전적 알고리즘의 모든 일반 조작 대상은, 이진수 문자열이 데이터 구조로서 코돈 목록/배열/벡터에 비해 더 작은 컴퓨터 메모리를 필요로 하고 더 빠른 조작 속도를 가능하게 하기 때문에, 이진수 문자열이다. 일부 실시형태에서, 3 비트의 모든 조합의 수가 특정 아미노산의 동의어 코돈의 가능한 모든 후보와 일치하기에 충분하기 때문에 3 연속 비트가 하나의 위치에서 코돈을 나타내기 위해 사용된다. 예를 들어, 3 비트는 8 종류의 조합, 예를 들어, 000, 001, 010, 011, 100, 101, 110, 111을 가지며, 그의 계수는 임의의 아미노산, 심지어 각각 6개 동의어 코돈을 갖는 아미노산 L, R 및 S의 동의어 코돈 수보다 더 크다.During the implementation of NSGA-III, a binary string, not a codon list/arrangement/vector, is selected as a data structure representing a nucleic acid sequence, and all general manipulation objects of the general genetic algorithm, including population initialization, cross/recombination, and mutation, are: Binary strings are binary strings because as a data structure they require less computer memory and enable faster manipulation speed compared to codon lists/arrays/vectors. In some embodiments, 3 consecutive bits are used to represent codons at one position because the number of all combinations of 3 bits is sufficient to match all possible candidates for synonym codons of a particular amino acid. For example, 3 bits have 8 kinds of combinations, e.g., 000, 001, 010, 011, 100, 101, 110, 111, and their coefficients are arbitrary amino acids, even amino acids with 6 synonym codons each Greater than the number of synonym codons for L, R and S.

따라서, 3 비트 문자열 중 각 하나는 주어진 아미노산의 동의어 코돈을 나타낸다. 적합성 계산 (예를 들어, 조화 지수, 코돈 문맥 지수 및 이상치 지수의 계산) 동안, 모집단의 개별 후보를 나타내는 이진수 문자열이 코딩 서열화 (즉, DNA)로 다시 변환된다. 한편, 상기에서 논의한 바와 같이 유전적 알고리즘의 조작 (교차, 돌연변이, 선택을 포함함)의 대상은 모두 이진수 문자열이고 따라서 변환은 일시적이다. 따라서 적합성 계산은 서열을 기반으로 하는 반면 다른 모든 조작은 효율성과 속도를 위해 이진수 문자열을 기반으로 한다.Thus, each one of a 3-bit string represents a synonym codon for a given amino acid. During suitability calculations (e.g., calculation of harmonic index, codon context index, and outlier index), binary strings representing individual candidates of the population are converted back to coding sequencing (i.e., DNA). On the other hand, as discussed above, the targets of genetic algorithm manipulation (including crossover, mutation, and selection) are all binary strings, and thus the conversion is temporary. Hence, conformance calculations are based on sequence, while all other manipulations are based on binary strings for efficiency and speed.

NSGA-III의 시작 전에, 모집단의 크기, 분할의 수, 시뮬레이션된 이진수 교차에 대한 분포 지수, 시뮬레이션된 이진수 교차에 대한 교차 비율, 비트 플립 돌연변이에 대한 돌연변이 비율, 비트 플립 돌연변이에 대한 분포 지수를 포함하여 복수의 매개변수가 설정되어질 필요가 있다. NSGA-III의 저자는 외부 및 내부 분할 수가 지정된 다수-목표 문제에 대한 분할에 대해 2-층 접근방식을 제안한다. 2-층 접근방식을 사용하기 위해, 본 발명자들은 분할의 수를 외부 분할의 수와 내부 분할의 수로 대체할 수 있었다. 모든 개체의 초기화 프로세스는 무작위이고 교차 및 돌연변이 조작은 도 2b에 도시된 고전적인 유전적 알고리즘과 큰 차이가 없다.Before the start of NSGA-III, include the size of the population, the number of divisions, the distribution index for the simulated binary intersection, the intersection ratio for the simulated binary intersection, the mutation ratio for the bit flip mutation, and the distribution index for the bit flip mutation. Thus, multiple parameters need to be set. The authors of NSGA-III propose a two-tiered approach to segmentation for multi-goal problems with a specified number of external and internal partitions. To use the two-tier approach, we could substitute the number of partitions with the number of outer partitions and the number of inner partitions. The initialization process of all individuals is random and the crossover and mutation manipulations are not significantly different from the classical genetic algorithm shown in FIG. 2B.

도 2b는 교차, 돌연변이 및 모집단 진화의 선택과 같은 생물-영감된 조작자를 포함하는 유전적 알고리즘의 예시적인 일반 작업흐름을 묘사한다. 본 발명의 구현 동안, 이진수 문자열은 서열을 나타내고 따라서 상기의 모든 조작자의 대상은 이진수 문자열이다.2B depicts an exemplary general workflow of a genetic algorithm involving bio-inspired manipulators such as selection of crossovers, mutations and population evolution. During the implementation of the present invention, a binary string represents a sequence and thus the object of all of the above operators is a binary string.

선택 전에 전체 모집단의 각 개체에 대해 적합성 함수 (즉, 이전에 도시된 3개의 지수 함수)가 평가되어야 할 때, 이진수 문자열은 일시적으로 코돈 문자열로 다시 전이된다. 다수의 진화 생성과 진화 종료 후, 최종적으로 생성된 코돈 문자열은 재조합 발현을 위해 사용되는 최적의 유전자로 사슬로 연결되고 출력될 것이다.When the fit function (i.e., the three exponential functions shown previously) must be evaluated for each individual of the entire population prior to selection, the binary string is temporarily transferred back to the codon string. After multiple evolutionary generations and the end of evolution, the finally generated codon string will be chained and output to the optimal gene used for recombinant expression.

일부 실시형태에서, 종료 조건은 고정된 횟수의 반복에 도달하고 최상의 적합성에 도달하고 더 나은 결과가 생성되지 않음, 최적에 근접한 솔루션의 최소 기준이 일부 솔루션에 의해 만족됨, 또는 이들의 임의의 조합을 포함하지만 이에 제한되지는 않는다.In some embodiments, the termination condition reaches a fixed number of iterations and the best fit is reached and no better results are produced, the minimum criterion for a solution that is close to optimal is satisfied by some solutions, or any combination thereof. Including, but not limited to.

NSGA-III 알고리즘의 교시에 따르면, 이들 최적의 유전자는 3차원 공간의 파레토 표면에 위치한 솔루션이어야 하고 동일하게 처리되어야 한다. 실질적인 목적을 위해, 유전자 합성 및 발현 테스트에 사용되는 제한된 원천에 기인하여, 본 발명자들은 이를 처음에 조화 지수의 내림 차순, 그 다음 코돈 문맥 지수의 내림 차순 및 마지막으로 이상치 지수의 오름 차순에 의해 서열화한다. 최상 1개는 합성을 위해 선택될 수 있고 할당량이 주어진 이종성 발현은 단지 하나의 서열이다. 엄격한 비용 제어가 없다고 가정하면, 파레토 표면에서 충분한 간격을 가지는 여러 가지, 예를 들어 조화 지수가 가장 높은 일 후보, 코돈 문맥 지수가 가장 높은 일 후보, 이상치 지수가 가장 낮은 일 후보를 시험하도록 권고된다. 본 발명에서, 예비 최적 유전자는 정지 코돈이 없고, 따라서 2 연속 정지 코돈이 코딩 서열의 3' 말단에 부착될 수 있다.According to the teachings of the NSGA-III algorithm, these optimal genes should be solutions located on the Pareto surface in three-dimensional space and treated identically. For practical purposes, due to the limited sources used in gene synthesis and expression testing, we sequence them first by descending order of harmonic index, then descending order of codon context index, and finally ascending order of outlier index. do. The best one can be selected for synthesis and the heterologous expression given a quota is only one sequence. Assuming there is no strict cost control, it is recommended to test several well-spaced candidates on the Pareto surface, e.g. one candidate with the highest harmonic index, one candidate with the highest codon context index, and one candidate with the lowest outlier index. . In the present invention, the preliminary optimal gene does not have a stop codon, so two consecutive stop codons can be attached to the 3'end of the coding sequence.

분자 molecule 클로닝을Cloning 위한 특정 하위서열 제거 Specific subsequences for removal

도 2a를 참조하면, 블록 (214)에서, 최적화 절차는 모티프 회피 및 제한 부위 제거의 단계를 포함한다. 분자 클로닝의 편의성을 끌어올리는 목적으로, 유전자 합성 및 단백질 발현 전에 하나 이상의 최적화된 서열로부터 일부 불리한 모티프 및 제한 부위 (예를 들어, 고객이 싫어하는 부분)를 제거한다. 이 과정은 다음을 함유한다:2A, at block 214, the optimization procedure includes the steps of motif avoidance and restriction site removal. For the purpose of enhancing the convenience of molecular cloning, some unfavorable motifs and restriction sites (eg, customer dislikes) are removed from one or more optimized sequences prior to gene synthesis and protein expression. This process includes:

단계 1: 회피되어야 하는 모든 하위서열 자리 찾기.Step 1: Find all subsequence positions that should be avoided.

단계 2: 하위서열 내에서 치환에 사용될 수 있는 모든 동의어 코돈을 열거한다.Step 2: List all synonym codons that can be used for substitution within the subsequence.

단계 3: 고도로 발현된 유전자 내에서 더 빈번하게 사용된 동의어 코돈은 본 발명자들이 동시에 새로운 하위서열이 나타나지 않도록 해야 하는 조건에서 선택에 대해 더 높은 우선순위를 갖는다.Step 3: Synonym codons that are used more frequently within highly expressed genes have a higher priority for selection under conditions where we must ensure that no new subsequences appear at the same time.

단계 4: 단계 2 - 3을 사용하여 발견된 모든 하위서열을 반복적으로 처리한다.Step 4: Iteratively process all subsequences found using Steps 2-3.

일부 실시형태에서, 블록 (206 및 208)에 표시된 바와 같이, 불리한 모티프 및 특징은 텍스트 마이닝 및 문헌 검토에 의해 다양한 숙주에 대해 별도로 확인된다.In some embodiments, as indicated by blocks 206 and 208, adverse motifs and features are identified separately for various hosts by text mining and literature review.

예시적인 실현Exemplary realization

본 명세서에 기재된 예시적인 실현은 CHO 3E7 세포주에서 두 유전자 (JNK3A1 및 GFP)의 최적화 및 발현을 통한 코돈 최적화에 대한 본 발명의 효율성을 예시하며, 그 기본 정보는 아래에 요약되어있다. 발현 수준을 평가하기 위해 Flag tag의 항체를 적용하여 웨스턴 블랏을 수행하였으므로 두 단백질의 C 말단에 Flag tag를 붙이고, 장입 대조군으로 베타-액틴(beta-actin)을 사용하였다. 각 발현 실험은 두 번 반복되었다.The exemplary realization described herein illustrates the effectiveness of the present invention for codon optimization through optimization and expression of two genes (JNK3A1 and GFP) in the CHO 3E7 cell line, the basic information of which is summarized below. In order to evaluate the expression level, a flag tag antibody was applied to perform Western blot, so a flag tag was attached to the C-terminus of the two proteins, and beta-actin was used as a loading control. Each expression experiment was repeated twice.

FreeStyle CHO 발현 배지 및 CD CHO 배지 (Thermofish)를 포함한 여러 배지에서 배양된 CHO 3E7의 mRNA-seq는 Illumina에 의해 권장된 고전적 mRNA-seq 제안에 따라 실행되었다. 당사의 성공적으로 최적화된 부분적 차순과 통합한, 총 500개 서열이 CHO 3E7 세포주의 고도로 발현된 유전자로 정의되었다. 문헌 검토 후, 다음 하위서열은 불리한 모티프로 그룹화되었으며, 이들의 출현은 페널티를 초래했다 (즉, 이상치 지수의 증가). 적절한 국소 (60 bp 슬라이딩-윈도우) 및 글로벌 GC-함량은 약 35-65%이고, mRNA 2차 구조의 허용가능한 최소 MFE △G는 -18 Kcal/mol이며, 이들 매개변수의 이상치는 페널티를 야기했다.The mRNA-seq of CHO 3E7 cultured in several media including FreeStyle CHO expression medium and CD CHO medium (Thermofish) was performed according to the classical mRNA-seq proposal recommended by Illumina. A total of 500 sequences, integrated with our successfully optimized partial sequence, were defined as highly expressed genes of the CHO 3E7 cell line. After literature review, the following subsequences were grouped by adverse motifs, and their appearance resulted in penalties (i.e., an increase in outlier index). Appropriate local (60 bp sliding-window) and global GC-content is about 35-65%, the minimum allowable MFE ΔG of the mRNA secondary structure is -18 Kcal/mol, and outliers of these parameters cause penalties. did.

1) 스플라이스 부위: GGTAAG, GGTGAT1) Splice site: GGTAAG, GGTGAT

2) AT-풍부 요소: ATTTTA, ATTTTTA, ATTTTTTA2) AT-rich elements: ATTTTA, ATTTTTA, ATTTTTTA

3) 리보솜 결합 부위: ACCACCATGG (서열 번호: 4), GCCACCATGG (서열 번호: 5)3) Ribosome binding site: ACCACCATGG (SEQ ID NO: 4), GCCACCATGG (SEQ ID NO: 5)

4) 항바이러스 모티프: TGTGT, AACGTT, CGTTCG, AGCGCT, GACGTC, GACGTT4) Antiviral motif: TGTGT, AACGTT, CGTTCG, AGCGCT, GACGTC, GACGTT

5) CpG 섬: CGCGCGCG5) CpG island: CGCGCGCG

6) 중합효소 슬리피지 부위: GGGGGG, CCCCCC6) Polymerase slippage site: GGGGGG, CCCCCC

7) 아밀로이드 전구체 단백질 3 프라임 안정성 요소: TCTCTTTACATTTTGGTCTCTATACTACA (서열 번호: 6)7) Amyloid precursor protein 3 prime stability factor: TCTCTTTACATTTTGGTCTCTATACTACA (SEQ ID NO: 6)

8) K-박스: CTGTGATA8) K-box: CTGTGATA

9) Brd-박스: AGCTTTA9) Brd-box: AGCTTTA

NSGA-III를 통한 코돈 최적화 동안, 모집단 크기는 100으로 설정되었고 개체는 이진수 인코딩되고 무작위로 생성되었으며, 그 길이는 단백질의 아미노산 수의 3배와 같았고, 진화 생성의 수는 200,000과 같았고, 분할의 수는 적합성 함수의 수에 따라 달라졌고, 시뮬레이션된 이진수 교차에 대한 분포 지수는 15.0이었고, 시뮬레이션된 이진수 교차에 대한 단일-점 교차 비율은 0.9였고, 비트 플립 돌연변이에 대한 돌연변이 비율은 1.0/L였고, 비트 플립 돌연변이에 대한 분포 지수는 20.0이었다.During codon optimization through NSGA-III, the population size was set to 100 and individuals were binary encoded and randomly generated, their length equal to three times the number of amino acids in the protein, the number of evolutionary generations equal to 200,000, and the number of divisions. The number depended on the number of fit functions, the distribution index for the simulated binary intersection was 15.0, the single-point intersection ratio for the simulated binary intersection was 0.9, and the mutation ratio for the bit flip mutation was 1.0/L. , The distribution index for the bit flip mutation was 20.0.

이상치 지수를 최소화하면서 조화 지수 및 코돈 문맥 지수를 최대화한 후, 각 단백질은 여러 출력 최적 코딩 유전자를 가지며, 그 중에서 최대 조화 지수를 갖는 단지 하나의 유전자가 다음 발현 테스트를 위해 선택되었다. EcoRI 및 HindIII 효소가 벡터 구축 및 클로닝에 사용되었으므로, GAATTC 및 AAGCTT는 코돈 치환에 의해 회피되었다.After maximizing the harmonic index and the codon context index while minimizing the outlier index, each protein has several output optimal coding genes, of which only one gene with the greatest harmonic index was selected for the next expression test. Since EcoRI and HindIII enzymes were used for vector construction and cloning, GAATTC and AAGCTT were avoided by codon substitution.

본 명세서에서 ASCII 텍스트 파일로 제출된 서열 목록은 두 단백질 GFP_플래그(Flag) (서열 번호: 7) 및 JNK3_플래그 (서열 번호: 8)의 최적화된 서열을 포함한다.The sequence listing submitted as an ASCII text file herein includes the optimized sequences of the two proteins GFP_Flag (SEQ ID NO: 7) and JNK3_Flag (SEQ ID NO: 8).

동일한 유전자의 야생형에 비해 최적화된 유전자의 성능을 평가하기 위해 사용된 실험의 상세한 단계는 아래에 설명되어 있다.The detailed steps of the experiment used to evaluate the performance of the optimized gene compared to the wild type of the same gene are described below.

단계 1: 일시적 형질감염 및 세포 배양Step 1: Transient transfection and cell culture

1. 합성된 유전자는 EcoRI 및 HindIII 효소를 사용하여 pTT5 벡터로 클로닝되었다. CHO 3E7 세포는 FreeStyle CHO 발현 배지에서 배양되었고 벡터의 일시적 형질 감염은 적절한 세포-벡터 비율 (즉, 벡터 농도 1 ug/ml에 대해 mL당 세포 밀도 1-1.2×10⁶)을 갖는 표준 분자 생물학 기술을 사용하여 수행되었다.1. The synthesized gene was cloned into pTT5 vector using EcoRI and HindIII enzymes. CHO 3E7 cells were cultured in FreeStyle CHO expression medium and transient transfection of the vector was performed using standard molecular biology techniques with an ^{appropriate cell-vector ratio (i.e., 1-1.2×10 6 cell density per mL for 1 ug/ml vector concentration).} Was done using.

2. 일시적 형질감염 후, CHO 3E7 세포는 37℃에서 5% CO₂로 48시간 지속하여 현탁 배양해야 했다.2. After transient transfection, CHO 3E7 cells had to be cultured in suspension _{at 37°C in 5% CO 2 for 48 hours.}

2 단계Step 2 : 세포 파괴: Cell destruction

1. 업스트림으로부터 배양된 세포를 취하고, 4℃에서 2분 동안 원심분리 (10,000 x g)한다. 상등액을 버린다.1. Take the cultured cells from the upstream and centrifuge (10,000 x g) for 2 minutes at 4°C. Discard the supernatant.

2. 1 mL 1*PBS를 부가하여 Eppendorf 튜브의 바닥에 세포를 재현탁한다. 그런 다음 4℃에서 2분 동안 원심분리 (10,000 x g)하고 상등액을 버린다.2. Resuspend the cells at the bottom of the Eppendorf tube by adding 1 mL 1*PBS. Then, centrifuge (10,000 x g) at 4° C. for 2 minutes and discard the supernatant.

3. 200 μL 용해 완충액 (저장성 완충액 [10mM Tris, 1.5mM MgCl₂, 10mM KCl, pH 7.9] + 0.5% DDM, PMSF [최종 농도 1mM], 뉴클레아제, 칵테일)을 1*10⁶세포당 Eppendorf 튜브에 부가한다. 피펫으로 세포를 재현탁한다.3. 200 μL lysis buffer (hypopotency buffer [10mM Tris, 1.5mM MgCl ₂ , 10mM KCl, pH 7.9] + 0.5% DDM, PMSF [final concentration 1mM], nuclease, cocktail) 1*10 ⁶ Eppendorf per cell Add to the tube. Resuspend the cells with a pipette.

4. 세포 파괴를 위해 컵형 초음파 세포 파괴기에 세포를 넣는다 (4℃, 3초 초음파, 1초 간격, 총 10분).4. For cell destruction, put the cells in a cup-type ultrasonic cell destroyer (4℃, 3 seconds ultrasound, 1 second interval, total 10 minutes).

5. 파괴 후, 4℃에서 20분 동안 원심분리 (12,000 x g)한다. 상등액을 회수한다.5. After destruction, centrifuge (12,000 x g) at 4° C. for 20 minutes. Recover the supernatant.

단계 3: 샘플 처리Step 3: sample processing

1. BCA 방법을 사용하여 상등액의 농도를 측정한다.1. Measure the concentration of the supernatant using the BCA method.

2. 상등액의 일부는 장입 완충액으로 처리되었다.2. Some of the supernatant was treated with the loading buffer.

단계 4: 전기영동 및 Step 4: electrophoresis and 웨스턴Western 블랏Blot

1. SOP에 따라 SDS-PAGE를 위해 처리된 샘플을 장입한다. (샘플당 8μg)1. Load the processed sample for SDS-PAGE according to the SOP. (8μg per sample)

2. 전기영동 후, 웨스턴 블랏 실험은 SOP에 따라 수행되었다.2. After electrophoresis, Western blot experiments were performed according to SOP.

1) 이송: SDS-PAGE 후 겔을 제거하고 겔에서 PVDF 멤브레인으로 단백질을 이송한다 (이송 완충액: 150mL의 무수 에탄올에 200mL 5x 이송 용액을 부가하고 1L로 희석하고, 1시간 동안 이송한다).1) Transfer: Remove the gel after SDS-PAGE and transfer the protein from the gel to the PVDF membrane (transfer buffer: add 200 mL 5x transfer solution to 150 mL of absolute ethanol, dilute to 1 L, transfer for 1 hour).

2) 차단: 이송 후, PVDF는 10분 동안 빠른 차단 용액으로 차단되었다.2) Blocking: After transfer, PVDF was blocked with fast blocking solution for 10 minutes.

3) 인큐베이션: 차단 후, 5% 밀크 및 상응하는 표지된 항체로 45분 동안 인큐베이션한다.(플래그 태그: 1시간 동안 1:1000 희석에서 THETM 베타 액틴 항체, mAb, 마우스 GenScript, 카탈로그 번호 A00702의 부가로, 1:5000의 희석에서 마우스-항-플래그 mAb GenScript, 카탈로그 번호 A00187, 그 다음 1:2500 희석된 표지된 2차 항체 염소 항-마우스 IgG-HRP GenScript, 카탈로그 번호 A00160 부가)3) Incubation: After blocking, incubate for 45 minutes with 5% milk and the corresponding labeled antibody (flag tag: addition of THETM beta actin antibody, mAb, mouse GenScript, catalog number A00702 at 1:1000 dilution for 1 hour With mouse-anti-flag mAb GenScript at a dilution of 1:5000, catalog number A00187, then 1:2500 diluted labeled secondary antibody goat anti-mouse IgG-HRP GenScript, catalog number A00160 added)

4) 노출: 노출 이미징은 항체 배양 후 ChemiDoc™ Touch Imaging Systems를 사용하여 수행되었고 이미지는 편집을 위해 지정된 위치에 저장된다.4) Exposure: Exposure imaging was performed using ChemiDoc™ Touch Imaging Systems after antibody incubation and images were saved to a designated location for editing.

5) Image Lab이 단백질 정량적 분석을 위해 사용되었다.5) Image Lab was used for protein quantitative analysis.

도 3은 본 개시내용의 실시형태에 따른 CHO 3E7 세포주에서 최적화된 서열과 야생형의 2개 유전자 (즉, GFP 및 JNK3A1) 사이의 발현의 비교를 예시하는 웨스턴 블랏 결과이며, 여기서 단지 각 유전자의 가장 높은 조화 지수를 갖는 최적화된 용액 만이 발현 비교를 위해 시험되었다. 본 발명이 코돈 최적화에 효과적이고 거의 변하지 않은 내부 대조군 베타-액틴에 비해 발현을 끌어올린다는 것이 명백히 입증되었다. 좌측 레인은 항상 래더 마커였고, 단일 플라스미드의 모든 발현은 두 번 반복되었다. 대략적인 정량적 분석에 따르면, GFP의 발현은 대략적으로 6.2배 개선된 것으로 추정되었고, JNK3의 발현은 본 발명의 코돈 최적화 후에 대략적으로 약 2.4배 촉진되었다.3 is a Western blot result illustrating a comparison of the expression between the optimized sequence and two wild-type genes (i.e., GFP and JNK3A1) in a CHO 3E7 cell line according to an embodiment of the present disclosure, wherein only the best of each gene Only optimized solutions with high coordination index were tested for expression comparison. It has been clearly demonstrated that the present invention is effective in codon optimization and elevates expression compared to the internal control beta-actin, which is little changed. The left lane was always a ladder marker, and all expressions of a single plasmid were repeated twice. According to a rough quantitative analysis, the expression of GFP was estimated to be improved approximately 6.2-fold, and the expression of JNK3 was promoted approximately 2.4-fold after codon optimization of the present invention.

예시적인 전자 장치Exemplary electronic device

도 4는 일 실시형태에 따른 컴퓨팅 장치의 예를 예시한다. 장치 (400)는 네트워크에 연결된 호스트 컴퓨터일 수 있다. 장치 (400)는 클라이언트 컴퓨터 또는 서버일 수 있다. 도 4에 도시된 바와 같이, 장치 (400)는 개인용 컴퓨터, 워크스테이션, 서버 또는 전화나 태블릿과 같은 핸드헬드 컴퓨팅 장치 (휴대용 전자 장치)와 같은 임의의 적절한 유형의 마이크로프로세서-기반 장치일 수 있다. 상기 장치는, 예를 들어, 프로세서 (410), 입력 장치 (420), 출력 장치 (430), 저장 장치 (440) 및 통신 장치 (460) 중 하나 이상을 포함할 수 있다. 입력 장치 (420) 및 출력 장치 (430)는 일반적으로 상기에서 기술된 것들에 대응할 수 있고, 컴퓨터와 연결가능하거나 또는 통합될 수 있다.4 illustrates an example of a computing device according to an embodiment. The device 400 may be a host computer connected to a network. Device 400 may be a client computer or a server. As shown in Figure 4, device 400 may be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (portable electronic device) such as a phone or tablet. . The device may include, for example, one or more of a processor 410, an input device 420, an output device 430, a storage device 440, and a communication device 460. The input device 420 and the output device 430 may generally correspond to those described above, and may be connectable or integrated with a computer.

입력 장치 (420)는 터치 스크린, 키보드 또는 키패드, 마우스 또는 음성-인식 장치와 같은 입력을 제공하는 임의의 적절한 장치일 수 있다. 출력 장치 (430)는 터치 스크린, 햅틱 장치 또는 스피커와 같은 출력을 제공하는 임의의 적절한 장치일 수 있다.Input device 420 may be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device. The output device 430 can be any suitable device that provides output, such as a touch screen, a haptic device, or a speaker.

저장 장치 (440)는 RAM, 캐시, 하드 드라이브 또는 이동식 저장 디스크를 포함하는 전기, 자기 또는 광학 메모리와 같은 저장을 제공하는 임의의 적절한 장치일 수 있다. 통신 장치 (460)는 네트워크 인터페이스 칩 또는 장치와 같은 네트워크를 통해 신호를 송수신할 수 있는 임의의 적절한 장치를 포함할 수 있다. 컴퓨터의 구성요소는 임의의 적절한 방식으로 예컨대 물리적 버스를 통하거나 또는 무선으로 연결될 수 있다.Storage device 440 may be any suitable device that provides storage such as electrical, magnetic or optical memory, including RAM, cache, hard drive, or removable storage disk. Communication device 460 may include any suitable device capable of sending and receiving signals over a network, such as a network interface chip or device. The components of the computer may be connected wirelessly or via a physical bus, for example, in any suitable manner.

저장 장치 (440)에 저장될 수 있고 프로세서 (410)에 의해 실행될 수 있는 소프트웨어 (450)는, 예를 들어, 본 개시의 기능을 구현하는 프로그래밍 (예를 들어, 상술한 바와 같은 장치에서 구현됨)을 포함할 수 있다.Software 450 that may be stored in storage device 440 and may be executed by processor 410 is, for example, programming (e.g., implemented in a device as described above) that implements the functionality of the present disclosure. ) Can be included.

소프트웨어 (450)는 또한 명령어 실행 시스템, 장비 또는 장치로부터의 소프트웨어와 연계된 명령어를 인출하고, 본 명령어를 실행할 수 있는, 상기 기술된 것과 같은 명령어 실행 시스템, 장비 또는 장치에 의해 또는 이와 관련하여 사용하기 위해 임의의 비-일시적 컴퓨터-판독가능 저장 매체 내에 저장 및/또는 전송될 수 있다. 본 개시내용의 맥락에서, 컴퓨터-판독가능 저장 매체는 명령어 실행 시스템, 장비 또는 장치에 의해 또는 이와 관련하여 사용하기 위한 프로그래밍을 함유하거나 저장할 수 있는 저장 장치 (440)와 같은 임의의 매체일 수 있다.The software 450 is also used by or in connection with an instruction execution system, equipment or device as described above, capable of fetching and executing instructions associated with software from the instruction execution system, equipment or device. May be stored and/or transmitted within any non-transitory computer-readable storage medium. In the context of the present disclosure, a computer-readable storage medium can be any medium, such as storage device 440 that can contain or store programming for use by or in connection with an instruction execution system, equipment or device. .

소프트웨어 (450)는 또한 명령어 실행 시스템, 장비 또는 장치로부터의 소프트웨어와 연계된 명령어를 인출하고, 본 명령어를 실행할 수 있는, 상기 기술된 것과 같은 명령어 실행 시스템, 장비 또는 장치에 의해 또는 이와 관련하여 사용하기 위해 임의의 전송 매체 내에서 전파될 수 있다. 본 개시내용의 맥락에서, 전송 매체는 명령어 실행 시스템, 장비 또는 장치에 의해 또는 이와 관련하여 사용하기 위한 프로그래밍을 통신, 전파 또는 전송할 수 있는 임의의 매체일 수 있다. 전송 판독가능 매체는 전자, 자기, 광학, 전자기 또는 적외선 유선 또는 무선 전파 매체를 포함할 수 있지만 이에 제한되지는 않는다.The software 450 is also used by or in connection with an instruction execution system, equipment or device as described above, capable of fetching and executing instructions associated with software from the instruction execution system, equipment or device. To be propagated within any transmission medium. In the context of the present disclosure, a transmission medium may be any medium capable of communicating, propagating or transmitting programming for use by or in connection with an instruction execution system, equipment or apparatus. Transmission readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic or infrared wired or wireless propagation media.

장치 (400)는 임의의 적합한 유형의 상호연결된 통신 시스템일 수 있는 네트워크에 연결될 수 있다. 네트워크는 임의의 적절한 통신 프로토콜을 구현할 수 있고 임의의 적절한 보안 프로토콜에 의해 보호될 수 있다. 네트워크는 무선 네트워크 연결, T1 또는 T3 라인, 케이블 네트워크, DSL 또는 전화선과 같은 네트워크 신호의 전송 및 수신을 구현할 수 있는 임의의 적절한 배열의 네트워크 링크를 포함할 수 있다.Device 400 may be connected to a network, which may be any suitable type of interconnected communication system. The network can implement any suitable communication protocol and can be protected by any suitable security protocol. The network may include any suitable arrangement of network links capable of implementing the transmission and reception of network signals such as wireless network connections, T1 or T3 lines, cable networks, DSL or telephone lines.

장치 (400)는 네트워크 상에서 동작하기에 적합한 임의의 운영 시스템을 구현할 수 있다. 소프트웨어 (450)는 C, C++, Java 또는 Python과 같은 적절한 프로그래밍 언어로 작성될 수 있다. 다양한 실시형태에서, 본 개시내용의 기능을 구현하는 애플리케이션 소프트웨어는 예를 들어 클라이언트/서버 배열에서 또는 웹-기반 애플리케이션 또는 웹 서비스로서 웹 브라우저를 통해 다른 구성으로 배치될 수 있다.Device 400 may implement any operating system suitable for operating on a network. The software 450 may be written in any suitable programming language such as C, C++, Java or Python. In various embodiments, application software implementing the functionality of the present disclosure may be deployed in other configurations through a web browser, for example in a client/server arrangement or as a web-based application or web service.

본 개시내용 및 예가 첨부하는 도면을 참조하여 완전하게 기술되었지만, 다양한 변경 및 변형이 당업자에게 명백할 것이라는 것에 유의해야 한다. 이러한 변경 및 변형은 청구범위에 의해 정의된 바와 같은 개시내용 및 예의 범주 내에 포함되는 것으로 이해되어야 한다.While the present disclosure and examples have been fully described with reference to the accompanying drawings, it should be noted that various changes and modifications will be apparent to those skilled in the art. It is to be understood that such changes and modifications are included within the scope of the disclosure and examples as defined by the claims.

설명의 목적을 위해, 전술한 상세한 설명은 특정 실시형태를 참조하여 기술되었다. 그러나, 상기 예시적인 논의는 완전한 것이거나 본 발명을 개시된 정확한 형태로 제한하려는 것으로 의도되지 않는다. 상기 교시의 관점에서 많은 변형 및 변경이 가능하다. 실시형태는 기술의 원리 및 실제 적용을 가장 잘 설명하기 위해 선택되고 기술되었다. 당업자는 이로써 고려되는 특정 사용에 적합한 다양한 변형과 함께 기술 및 다양한 실시형태를 가장 잘 활용할 수 있다.For purposes of explanation, the foregoing detailed description has been described with reference to specific embodiments. However, the above exemplary discussion is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many variations and modifications are possible in view of the above teaching. The embodiments have been selected and described to best explain the principles and practical applications of the technology. Those skilled in the art can best utilize the technology and various embodiments, along with various modifications suitable for the particular use contemplated by this.

SEQUENCE LISTING <110> Nanjingjinsirui Science &Technology Biology Corp. <120> CODON OPTIMIZATION <130> 75989-20004.40 <140> Not Yet Assigned <141> Concurrently Herewith <160> 8 <170> FastSEQ for Windows Version 4.0 <210> 1 <211> 5 <212> RNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 1 auuua 5 <210> 2 <211> 13 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 2 tgyygatgyy yyy 13 <210> 3 <211> 6 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 3 aggagg 6 <210> 4 <211> 10 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 4 accaccatgg 10 <210> 5 <211> 10 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 5 gccaccatgg 10 <210> 6 <211> 29 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 6 tctctttaca ttttggtctc tatactaca 29 <210> 7 <211> 738 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 7 atgtctaagg gagaagagct gtttaccggc gtggtcccta tcctggtgga gctggacggc 60 gatgtgaacg gccagaaatt cagcgtgtcc ggcgagggcg aaggcgacgc cacctacggc 120 aagctgacac tgaagttcat ctgcaccacc ggcaagctgc ctgtcccttg gccaacactg 180 gtgaccacct tcagctacgg agtgcaatgc ttctccagat accctgacca catgaagcag 240 cacgatttct ttaaatctgc catgcccgag ggctacgtgc aggaacggac catcttctac 300 aaggacgacg gaaattacaa gaccagagcc gaggtgaagt tcgagggcga caccctggtg 360 aaccggatcg agctgaaggg catcgacttc aaagaggatg gcaacatcct gggccacaag 420 atggaataca actacaactc ccacaacgtg tacatcatgg ccgacaagcc taagaacggc 480 atcaaggtga acttcaagat cagacacaac atcaaggacg gctctgtgca gctggccgac 540 cactaccagc agaacacccc catcggcgac ggccctgtgc tgctgcctga taaccactat 600 ctgtctacac agtccgctct gtccaaagat cctaacgaga agcgggacca catgatcctg 660 ctggagttcg tgaccgccgc tggcatcacc catggcatgg acgagctgta caaggactac 720 aaggacgatg acgacaag 738 <210> 8 <211> 1290 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 8 atgtccctgc actttctgta ctactgctct gagcctaccc tggacgtgaa gatcgccttc 60 tgtcagggct tcgataagca ggtggacgtc tcctatatcg ctaagcacta caacatgagc 120 aaatccaagg tggacaacca gttctactct gtcgaggtgg gcgactctac cttcaccgtg 180 ctgaagagat accagaacct gaaacccatc ggctccggcg ctcagggcat cgtgtgcgcc 240 gcttacgacg ccgtgctgga tagaaacgtg gccatcaaga agctgagccg gcctttccag 300 aaccagacac acgctaagcg ggcctacaga gagctggtcc tgatgaagtg cgtgaaccac 360 aagaacatca tctccctgct gaatgtgttc acccctcaga aaaccctgga agagttccag 420 gatgtgtacc tggtgatgga actgatggac gccaacctgt gccaggtgat ccagatggaa 480 ctggaccacg agcggatgtc ctacctgctg taccagatgc tgtgtggcat caagcacttg 540 catagcgctg gcatcatcca cagagatctg aaaccttcta acatcgtggt gaagtccgac 600 tgcaccctga agatcctgga cttcggcctg gccagaaccg ctggcacctc tttcatgatg 660 acaccctacg tggtgaccag atactaccgg gcccctgaag tgatcctggg catgggctac 720 aaggagaacg tggacatctg gtccgtggga tgcatcatgg gcgagatggt cagacacaag 780 atcctgttcc ccggaagaga ttacatcgac cagtggaaca aggtgatcga gcagctgggc 840 accccttgtc ctgagttcat gaagaaactg cagcctaccg tgcggaacta cgtggaaaac 900 cggcctaagt acgccggcct gacctttcca aagctgttcc ctgactctct gttccccgct 960 gacagcgagc acaacaagct gaaagcctct caggccagag atctgctgtc caagatgctg 1020 gtgatcgacc ctgctaagag aatctccgtg gacgatgccc tgcagcaccc ctacatcaac 1080 gtgtggtacg accctgctga ggtggaagcc cctccacctc agatctacga caagcagctg 1140 gacgaaagag agcacaccat cgaggagtgg aaggagctga tctataaaga agtgatgaac 1200 tccgaggaaa agaccaagaa cggcgtggtc aagggccagc cttccccctc tgctcaggtg 1260 cagcaagact acaaggacga tgatgacaag 1290 SEQUENCE LISTING <110> Nanjingjinsirui Science & Technology Biology Corp. <120> CODON OPTIMIZATION <130> 75989-20004.40 <140> Not Yet Assigned <141> Concurrently Herewith <160> 8 <170> FastSEQ for Windows Version 4.0 <210> 1 <211> 5 <212> RNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 1 auuua 5 <210> 2 <211> 13 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 2 tgyygatgyy yyy 13 <210> 3 <211> 6 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 3 aggagg 6 <210> 4 <211> 10 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 4 accaccatgg 10 <210> 5 <211> 10 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 5 gccaccatgg 10 <210> 6 <211> 29 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 6 tctctttaca ttttggtctc tatactaca 29 <210> 7 <211> 738 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 7 atgtctaagg gagaagagct gtttaccggc gtggtcccta tcctggtgga gctggacggc 60 gatgtgaacg gccagaaatt cagcgtgtcc ggcgagggcg aaggcgacgc cacctacggc 120 aagctgacac tgaagttcat ctgcaccacc ggcaagctgc ctgtcccttg gccaacactg 180 gtgaccacct tcagctacgg agtgcaatgc ttctccagat accctgacca catgaagcag 240 cacgatttct ttaaatctgc catgcccgag ggctacgtgc aggaacggac catcttctac 300 aaggacgacg gaaattacaa gaccagagcc gaggtgaagt tcgagggcga caccctggtg 360 aaccggatcg agctgaaggg catcgacttc aaagaggatg gcaacatcct gggccacaag 420 atggaataca actacaactc ccacaacgtg tacatcatgg ccgacaagcc taagaacggc 480 atcaaggtga acttcaagat cagacacaac atcaaggacg gctctgtgca gctggccgac 540 cactaccagc agaacacccc catcggcgac ggccctgtgc tgctgcctga taaccactat 600 ctgtctacac agtccgctct gtccaaagat cctaacgaga agcgggacca catgatcctg 660 ctggagttcg tgaccgccgc tggcatcacc catggcatgg acgagctgta caaggactac 720 aaggacgatg acgacaag 738 <210> 8 <211> 1290 <212> DNA <213> Artificial Sequence <220> <223> Synthetic Construct <400> 8 atgtccctgc actttctgta ctactgctct gagcctaccc tggacgtgaa gatcgccttc 60 tgtcagggct tcgataagca ggtggacgtc tcctatatcg ctaagcacta caacatgagc 120 aaatccaagg tggacaacca gttctactct gtcgaggtgg gcgactctac cttcaccgtg 180 ctgaagagat accagaacct gaaacccatc ggctccggcg ctcagggcat cgtgtgcgcc 240 gcttacgacg ccgtgctgga tagaaacgtg gccatcaaga agctgagccg gcctttccag 300 aaccagacac acgctaagcg ggcctacaga gagctggtcc tgatgaagtg cgtgaaccac 360 aagaacatca tctccctgct gaatgtgttc acccctcaga aaaccctgga agagttccag 420 gatgtgtacc tggtgatgga actgatggac gccaacctgt gccaggtgat ccagatggaa 480 ctggaccacg agcggatgtc ctacctgctg taccagatgc tgtgtggcat caagcacttg 540 catagcgctg gcatcatcca cagagatctg aaaccttcta acatcgtggt gaagtccgac 600 tgcaccctga agatcctgga cttcggcctg gccagaaccg ctggcacctc tttcatgatg 660 acaccctacg tggtgaccag atactaccgg gcccctgaag tgatcctggg catgggctac 720 aaggagaacg tggacatctg gtccgtggga tgcatcatgg gcgagatggt cagacacaag 780 atcctgttcc ccggaagaga ttacatcgac cagtggaaca aggtgatcga gcagctgggc 840 accccttgtc ctgagttcat gaagaaactg cagcctaccg tgcggaacta cgtggaaaac 900 cggcctaagt acgccggcct gacctttcca aagctgttcc ctgactctct gttccccgct 960 gacagcgagc acaacaagct gaaagcctct caggccagag atctgctgtc caagatgctg 1020 gtgatcgacc ctgctaagag aatctccgtg gacgatgccc tgcagcaccc ctacatcaac 1080 gtgtggtacg accctgctga ggtggaagcc cctccacctc agatctacga caagcagctg 1140 gacgaaagag agcacaccat cgaggagtgg aaggagctga tctataaaga agtgatgaac 1200 tccgaggaaa agaccaagaa cggcgtggtc aagggccagc cttccccctc tgctcaggtg 1260 cagcaagact acaaggacga tgatgacaag 1290

Claims

A computer-implemented method for optimizing nucleic acid sequences for protein expression in a host, the method comprising:
a) receiving an initial population set, wherein the initial population set comprises a plurality of initial candidate nucleic acid sequences capable of expressing a protein; And
b) Based on the initial population set, computer-assisted NSGA-III algorithms or variations thereof are used to perform optimization of the harmonic index, codon context index, and outlier index, thereby expressing the protein. Obtaining a plurality of optimized nucleic acid sequences,
Here, the harmony index of the candidate nucleic acid sequence represents the consistency of the frequency distribution of synonym codons between the plurality of highly expressed genes and the candidate nucleic acid sequence,
Wherein the codon context index of the candidate nucleic acid sequence is a measure for placing synonym codons in appropriate positions, and
Wherein the outlier index of the candidate nucleic acid sequence is a measure of the negative effect of a plurality of predetermined sequence features on the candidate nucleic acid sequence.

The method of claim 1, further comprising providing an output representative of at least one optimized nucleic acid sequence of the plurality of optimized nucleic acid sequences.

The method of claim 1 or 2, wherein receiving the initial population set comprises:
Receiving a protein sequence;
Generating an initial population set based on the received protein sequence.

The method of claim 1 or 2, wherein receiving the initial population set comprises:
Receiving a nucleic acid sequence;
Translating the received nucleic acid sequence into a protein sequence;
Generating an initial population set based on the protein sequence.

The method of any one of claims 1 to 4, wherein the initial population set is of a predetermined size.

6. The method of any one of claims 1-5, wherein the initial population set comprises a binary representation of a plurality of initial candidate nucleic acid sequences.

The method according to any one of claims 1 to 6, wherein the step of performing optimization of the harmonic index, codon context index and outlier index comprises:
Maximizing the harmonic index;
Maximizing the codon context index; And
Minimizing the outlier index.

The method of any one of claims 1 to 7, wherein the step of performing optimization of the harmonic index, codon context index and outlier index comprises:
For each initial candidate nucleic acid sequence in the initial population set, calculating each harmony index value, each codon context index value, and each outlier index value for each initial candidate nucleic acid sequence;
Allocating a plurality of suitability values corresponding to a plurality of initial candidate nucleic acid sequences based on the calculation;
Sorting a plurality of initial candidate nucleic acid sequences based on the plurality of suitability values; And
Incorporating a subset of the sorted plurality of initial candidate nucleic acid sequences into a subsequent population set.

The method of claim 8, further comprising the following steps:
Generating an offspring population based on the initial population; And
Including the progeny population in a subsequent population set.

The method of claim 9, wherein the progeny population is generated through binary tournament selection, cross/recombination, mutation, or any combination thereof.

11. The method of any one of claims 8-10, wherein the initial population set and the subsequent population set are of the same size.

The method according to any one of claims 1 to 11,
The step of performing optimization of the harmonic index, codon context index and outlier index includes a plurality of iterations,
The i-th iteration of the plurality of iterations is:
receiving a population set of nucleic acid sequences corresponding to the (i-1) th iteration;
associating each nucleic acid sequence of the population set corresponding to the (i-1) th iteration with a non-domination level;
Classifying nucleic acid sequences in the population set corresponding to the (i-1) th iteration based on the associated non-dominant level;
generating a population set corresponding to the i-th iteration, wherein the population set corresponding to the i-th iteration is a subset of the classified nucleic acid sequence corresponding to the (i-1)th iteration and the (i-1)th iteration Comprising a progeny population generated based on the classified nucleic acid sequence corresponding to the repeat; And
Determining whether to proceed to the (i+1) th iteration using the population set corresponding to the i-th iteration based on the one or more termination conditions.

The method of claim 12, wherein the step of associating each nucleic acid sequence with a non-dominant level comprises, for each nucleic acid sequence of the population set corresponding to the (i-1) th iteration, each harmony index value, each codon context index value, and Calculating each outlier exponent value.

The method of claim 10 or 11, wherein generating a population set corresponding to the i-th iteration comprises:
associating at least one nucleic acid sequence of the classified nucleic acid sequences corresponding to the (i-1) th iteration with one of the plurality of predetermined reference points.

The method according to any one of claims 10 to 12, wherein the one or more termination conditions reach a fixed number of iterations and reach the best fit and do not produce better results, the minimum criterion of a solution that is close to optimal is in some solutions. Satisfied by, or any combination thereof.

The method of any one of claims 1 to 15, wherein the harmony index of the candidate nucleic acid sequence is calculated based on the formula: H = 1-D( F _hs ,F _{ts ),}
In the above, D() represents a distance function;
F _hs comprises a vector comprising the frequency of synonym codons of a plurality of amino acids in a plurality of highly expressed genes; And
F _ts comprises a vector comprising a frequency of synonym codons of a plurality of amino acids within the coding gene of the candidate nucleic acid sequence.

The method of claim 16, wherein D() represents a function measuring the distance between two vectors.

The method of claim 17, wherein D() is a distance function including, but not limited to, a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance of two vectors.

The method of claim 18, wherein the frequency of synonym codons in a plurality of highly expressed genes or candidate nucleic acid sequences is defined as follows:

The method of any one of claims 1 to 19, wherein the codon context index of the candidate nucleic acid sequence is calculated based on the formula: CC = 1-D( F _hcc ,F _{tcc ),}
In the above equation, D() represents a distance function;
F _hcc comprises a vector comprising the frequency of synonym codon pairs of two consecutive amino acids in a plurality of highly expressed genes; And
F _tcc comprises a vector comprising the frequency of synonym codon pairs of two consecutive amino acids within the coding gene of the candidate nucleic acid sequence.

The method of claim 20, wherein D() represents a function measuring the distance between two vectors.

The method of claim 21, wherein D() is a distance function including, but not limited to, a Euclidean distance, a cosine distance, a Manhattan distance, or a Minkowski distance of two vectors.

The method of any one of claims 20-22, wherein the frequencies of synonym codon pairs of a plurality of highly expressed genes or candidate nucleic acid sequences are defined as follows:

The method of any one of claims 1 to 23, wherein the outlier index is an equation

Is calculated based on
Wherein N is the number of a plurality of predetermined sequence features;
fi(x) represents the penalty scoring function of the i-th sequence feature among a plurality of predetermined sequence features; And
wi denotes a relative weight associated with fi(x).

The method of claim 24, wherein the plurality of predetermined features comprises:
GC-content value,
CIS element,
Repeating Element,
RNA splicing site,
Ribosome binding sequence,
the minimum free energy of the mRNA, or
Any combination of these.

25. The method of claim 24, wherein the plurality of predetermined features are identified based on a selected expression system.

27. The method of any one of claims 1-26, wherein the modification of the NSGA-III algorithm comprises an EliteNSGA-III algorithm or an NSGA-II based immune algorithm.

28. The method of any one of claims 1-27, wherein the step of performing optimization of the harmonic index, codon context index and outlier index comprises:
Ranking the plurality of optimized nucleic acid sequences by descending order of harmonic index, then descending order of codon context index, and then ascending order of outlier index;
Selecting one or more top-level optimized nucleic acid sequences for synthesis.

The method of any one of claims 1-28, further comprising:
c) removing a predetermined adverse subsequence or motif from the optimized nucleic acid sequence of the plurality of optimized nucleic acid sequences.

30. The method of claim 29, wherein the predetermined adverse subsequence or motif is identified based on analysis of a plurality of text portions.

The method of claim 29, wherein removing the predetermined adverse subsequence or motif comprises:
Identifying the predetermined adverse subsequence or motif in the optimized nucleic acid sequence;
Identifying a plurality of synonym codons based on the identified predetermined adverse subsequence or motif;
Selecting a synonym codon from the plurality of synonym codons for substitution with the identified predetermined adverse subsequence in the optimized nucleic acid sequence.

The method of any one of claims 1-31, wherein at least one of the harmonic index, the codon context index, and the outlier index is calculated based on one or more characteristics of a plurality of highly-expressed genes from one or more databases. Way.

33. The method of claim 32, wherein the one or more characteristics comprise codon frequency, synonym codon frequency, codon pair frequency, or a combination thereof.

34. The method of any one of claims 1-33, further comprising setting one or more parameters, wherein the one or more parameters are the size of the population set, the number of divisions, the distribution index for the simulated binary intersection, A crossover ratio for a simulated binary crossover, a mutation ratio for a bit flip mutation, a distribution index for a bit flip mutation, or any combination thereof.

A non-transitory computer-readable storage medium storing one or more programs, wherein the one or more programs cause the electronic device to perform the method of any one of claims 1 to 34 when executed by one or more processors of the electronic device. A storage medium containing instructions to do.

A system for optimizing a nucleic acid sequence for protein expression in a host, the system comprising:
One or more processors;
Memory; And
One or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, and the one or more programs include instructions for performing the method of any one of claims 1 to 34. Containing, system.

An electronic device for optimizing a nucleic acid sequence for protein expression in a host, the device comprising means for performing the method of claim 1.

A program product stored in a recordable medium for optimizing a nucleic acid sequence for protein expression in a host, the program product comprising computer software for performing the method of claim 1.

An isolated nucleic acid molecule comprising an optimized nucleic acid sequence obtained from the method of any one of claims 1 to 34.

A vector comprising the isolated nucleic acid molecule of claim 39.

A recombinant host cell comprising the isolated nucleic acid molecule of claim 39 or the vector of claim 40.

A method of expressing a protein in a host cell, the method comprising:
(a) obtaining an optimized nucleic acid sequence for protein expression in a host cell using the method of any one of claims 1 to 34;
(b) synthesizing a nucleic acid molecule comprising the optimized nucleic acid sequence;
(c) introducing the nucleic acid molecule into the host cell to obtain a recombinant host cell; And
(d) culturing the recombinant host cell under conditions that permit expression of the protein from the optimized nucleic acid sequence.